Bug 1190632 - Failure to reboot guests consistently at least 100 times
Summary: Failure to reboot guests consistently at least 100 times
Status: REOPENED
: 1190633 (view as bug list)
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Dan Čermák
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-17 22:47 UTC by Luis Chamberlain
Modified: 2022-05-27 18:27 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
reboot test (1.73 KB, text/plain)
2021-09-27 20:35 UTC, James Fehlig
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luis Chamberlain 2021-09-17 22:47:46 UTC
Upstream kdevops [0] has a workflow test called "reboot-limit". The goal is to spawn a guest and then reboot it and confirm each reboot at least 100 times. If that fails the test fails. On to of this, the real goal is to run this simple test 100 times, so for a full set of 10,000 reboots. This is called achieving steady state for the reboot-limit kdevops test.

To spawn guests vagrant is used. Right now I'm using this to vagrant with libvirt, and so behind the scenes this is all libvirt. I've been following Tumbleweed for a while now and in all releases so far I've only been able to accomplish a successful reboot-limit test only once, and so it would seem it might just be chance.

kdevops uses ansible to control guests. There are 3 simple steps to the reboot:

1) ansible reboot module will first ssh to the guest and query /proc/sys/kernel/random/boot_id and saves this random UUID.
2) Ansible will ssh to the guest and issue '/sbin/shutdown -r 0 ' to reboot the guest
3) Yet again ansible will try to loop trying to ssh to the guest for up to 600 seconds, and once it logs in it will query again /proc/sys/kernel/random/boot_id. If the UUIDs match then the reboot is determined to have failed.

Now, using /proc/sys/kernel/random/boot_id does not change in containers, and so on ansible 2.10 boot_time_command was added to the reboot module to allow you to specify another check. But for guests, this should work fine.

OpenSUSE Tumbleweed is running ansible-2.9.24-1.1.noarch. With this the reboot-failure can't pass more than 100 reboots easily, sometimes maybe it reaches 200 or 500 but that's it.

A workaround I noticed was to use:

diff --git a/playbooks/roles/reboot-limit/tasks/do-reboot.yml b/playbooks/roles/reboot-limit/tasks/do-reboot.yml
index a6876d2..2841bf6 100644
--- a/playbooks/roles/reboot-limit/tasks/do-reboot.yml
+++ b/playbooks/roles/reboot-limit/tasks/do-reboot.yml
@@ -12,6 +12,7 @@
   become: yes
   become_method: sudo
   reboot:
+    post_reboot_delay: 10
   when:
     - reboot_limit_test_type == "ansible"
   tags: [ 'run_tests' ]


I've had to merge the above fix to kdevops because otherwise this test fails easily... With this we are able to get 10,000 reboots on OpenSUSE Tumbleweed.

But, on Debian testing using ansible 2.10.7 and linux-next, I'm easily able to get 1,000,0000 reboots and without the above hack. Which begs the question: where is the issue?

I checked ansible with verbose (ansible-playbook -vvv) for the test and it would seem everything checks out except the timeout really happens. But the odd thing is that you are able to still ssh to the host when you go and check to see what the hell happened. Which would seem to indicate the issue might be something else.

[0] https://github.com/mcgrof/kdevops
Comment 1 Andreas Stieger 2021-09-20 07:10:50 UTC
*** Bug 1190633 has been marked as a duplicate of this bug. ***
Comment 2 James Fehlig 2021-09-20 20:46:10 UTC
(In reply to Luis Chamberlain from comment #0)
> Upstream kdevops [0] has a workflow test called "reboot-limit". The goal is
> to spawn a guest and then reboot it and confirm each reboot at least 100
> times. If that fails the test fails. On to of this, the real goal is to run
> this simple test 100 times, so for a full set of 10,000 reboots. This is
> called achieving steady state for the reboot-limit kdevops test.
> 
> To spawn guests vagrant is used. Right now I'm using this to vagrant with
> libvirt, and so behind the scenes this is all libvirt. I've been following
> Tumbleweed for a while now and in all releases so far I've only been able to
> accomplish a successful reboot-limit test only once, and so it would seem it
> might just be chance.
> 
> kdevops uses ansible to control guests. There are 3 simple steps to the
> reboot:
> 
> 1) ansible reboot module will first ssh to the guest and query
> /proc/sys/kernel/random/boot_id and saves this random UUID.
> 2) Ansible will ssh to the guest and issue '/sbin/shutdown -r 0 ' to reboot
> the guest
> 3) Yet again ansible will try to loop trying to ssh to the guest for up to
> 600 seconds, and once it logs in it will query again
> /proc/sys/kernel/random/boot_id. If the UUIDs match then the reboot is
> determined to have failed.

I created a similar test that uses libvirt directly. It is running but has only completed 36 iterations.

> Now, using /proc/sys/kernel/random/boot_id does not change in containers,
> and so on ansible 2.10 boot_time_command was added to the reboot module to
> allow you to specify another check. But for guests, this should work fine.
> 
> OpenSUSE Tumbleweed is running ansible-2.9.24-1.1.noarch. With this the
> reboot-failure can't pass more than 100 reboots easily, sometimes maybe it
> reaches 200 or 500 but that's it.
> 
> A workaround I noticed was to use:
> 
> diff --git a/playbooks/roles/reboot-limit/tasks/do-reboot.yml
> b/playbooks/roles/reboot-limit/tasks/do-reboot.yml
> index a6876d2..2841bf6 100644
> --- a/playbooks/roles/reboot-limit/tasks/do-reboot.yml
> +++ b/playbooks/roles/reboot-limit/tasks/do-reboot.yml
> @@ -12,6 +12,7 @@
>    become: yes
>    become_method: sudo
>    reboot:
> +    post_reboot_delay: 10
>    when:
>      - reboot_limit_test_type == "ansible"
>    tags: [ 'run_tests' ]
> 
> 
> I've had to merge the above fix to kdevops because otherwise this test fails
> easily... With this we are able to get 10,000 reboots on OpenSUSE Tumbleweed.
> 
> But, on Debian testing using ansible 2.10.7 and linux-next, I'm easily able
> to get 1,000,0000 reboots and without the above hack. Which begs the
> question: where is the issue?

Hopefully my test will help sort that out. If it runs fine at the libvirt level, then I'd suspect something in ansible or vagrant-libvirt.

BTW, although TW has older ansible (and perhaps vagrant stuff) than Debian testing, it has a newer libvirt (7.7.0 vs 7.6.0) and qemu (6.1.0 vs 1.6.1).
Comment 3 Dan Čermák 2021-09-21 06:37:51 UTC
What is the actual issue at hand here? I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant? Can you reproduce this consistently?
Comment 4 James Fehlig 2021-09-21 15:23:13 UTC
(In reply to James Fehlig from comment #2)
> I created a similar test that uses libvirt directly. It is running but has
> only completed 36 iterations.

Update: The test has now completed 1680 iterations without encountering an issue.
Comment 5 Luis Chamberlain 2021-09-21 20:12:20 UTC
(In reply to Dan Čermák from comment #3)
> What is the actual issue at hand here? 

After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).

> I am honestly puzzled what actually
> the problem is: do your VMs time out to reboot? Do you reboot them via
> vagrant?

No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0

I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.

>  Can you reproduce this consistently?

Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600.

On debian testing I have no issues and got up to 1,000,000 easy.
Comment 12 Dan Čermák 2021-09-22 06:35:01 UTC
(In reply to Luis Chamberlain from comment #5)
> (In reply to Dan Čermák from comment #3)
> > What is the actual issue at hand here? 
> 
> After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh
> access times out (over 600 seconds default is the ansible default).
> 
> > I am honestly puzzled what actually
> > the problem is: do your VMs time out to reboot? Do you reboot them via
> > vagrant?
> 
> No, ansible is used to reboot, as noted in the first comment. I also looked
> how ansible does this and it is: /sbin/shutdown -r 0
> 
> I also noted exactly the algorithm used on comment #1 about how it
> determines if the reboot was successful or not.
> 
> >  Can you reproduce this consistently?
> 
> Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10,
> sometimes with 80, sometimes after 600.
> 
> On debian testing I have no issues and got up to 1,000,000 easy.

Could you export the environment variable `VAGRANT_LOG=DEBUG` during your tests, capture stdout & stderr and upload the last logfile when the box times out?

Also, could you please provide the version numbers of the respective components (vagrant, vagrant-libvirt, libvirt, the ruby gems, etc.) on openSUSE and on Debian?
Comment 14 Luis Chamberlain 2021-09-22 19:09:46 UTC
(In reply to Dan Čermák from comment #12)
> (In reply to Luis Chamberlain from comment #5)
> > (In reply to Dan Čermák from comment #3)
> > > What is the actual issue at hand here? 
> > 
> > After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh
> > access times out (over 600 seconds default is the ansible default).
> > 
> > > I am honestly puzzled what actually
> > > the problem is: do your VMs time out to reboot? Do you reboot them via
> > > vagrant?
> > 
> > No, ansible is used to reboot, as noted in the first comment. I also looked
> > how ansible does this and it is: /sbin/shutdown -r 0
> > 
> > I also noted exactly the algorithm used on comment #1 about how it
> > determines if the reboot was successful or not.
> > 
> > >  Can you reproduce this consistently?
> > 
> > Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10,
> > sometimes with 80, sometimes after 600.
> > 
> > On debian testing I have no issues and got up to 1,000,000 easy.
> 
> Could you export the environment variable `VAGRANT_LOG=DEBUG` during your
> tests, capture stdout & stderr and upload the last logfile when the box
> times out?

There are no vagrant commands run for the reboot test. The provisioning all works fine. So do I still set this up when I run the ansible command to run the reboot-limit playbook?
 
> Also, could you please provide the version numbers of the respective
> components (vagrant, vagrant-libvirt, libvirt, the ruby gems, etc.) on
> openSUSE and on Debian?


OpenSUSE:

ruby3.0-rubygem-logging-2.3.0-1.5.x86_64
ruby3.0-rubygem-fog-json-1.2.0-1.8.x86_64
libruby3_0-3_0-3.0.1-3.1.x86_64
ruby2.7-2.7.3-3.1.x86_64
ruby-common-2.6-6.1.noarch
system-group-libvirt-20170617-23.1.noarch
yast2-ruby-bindings-4.4.4-1.2.x86_64
ruby-solv-0.7.19-1.2.x86_64
ruby2.7-rubygem-nokogiri-1.12.3-1.1.x86_64
ruby2.7-rubygem-gem2rpm-0.10.1-17.5.x86_64
ruby3.0-rubygem-little-plugger-1.1.4-1.17.x86_64
ruby2.7-rubygem-unf_ext-0.0.7.7-1.8.x86_64
ruby2.7-rubygem-ruby-dbus-0.16.0-1.2.x86_64
ruby2.7-rubygem-ruby-augeas-0.5.0-3.23.x86_64
ruby2.7-rubygem-fast_gettext-2.0.3-1.5.x86_64
ruby2.7-rubygem-abstract_method-1.2.1-2.21.x86_64
ruby-2.7-1.5.x86_64
ruby3.0-rubygem-listen-3.7.0-1.1.x86_64
ruby2.7-rubygem-unf-0.1.4-1.20.x86_64
ruby2.7-rubygem-cfa-1.0.2-1.8.x86_64
ruby2.7-rubygem-cheetah-0.5.2-1.8.x86_64
ruby2.7-rubygem-simpleidn-0.2.1-1.3.x86_64
ruby2.7-rubygem-cfa_grub2-2.0.0-1.8.x86_64
ruby3.0-rubygem-hashicorp-checkpoint-0.1.5-1.12.x86_64
vagrant-vim-2.2.18-1.2.noarch
libstorage-ng-ruby-4.4.36-1.1.x86_64
ruby3.0-rubygem-nokogiri-1.12.3-1.1.x86_64
libvirt-7.7.0-1.1.x86_64
vagrant-bash-completion-2.2.18-1.2.noarch
ruby3.0-3.0.1-3.1.x86_64
libruby2_7-2_7-2.7.3-3.1.x86_64
libvirt-daemon-7.7.0-1.1.x86_64
ruby3.0-rubygem-rubyzip-2.3.2-1.1.x86_64
ruby3.0-rubygem-rubyntlm-0.6.3-1.1.x86_64
ruby3.0-rubygem-rb-fsevent-0.11.0-1.1.x86_64
ruby3.0-rubygem-nori-2.6.0-1.16.x86_64
ruby3.0-rubygem-net-ssh-6.1-6.1.0-1.1.x86_64
ruby3.0-rubygem-multi_json-1.15.0-1.5.x86_64
ruby3.0-rubygem-mime-types-data-3.2021.0704-1.1.x86_64
ruby3.0-rubygem-log4r-1.1.10-9.23.x86_64
ruby3.0-rubygem-httpclient-2.8.3-1.21.x86_64
ruby3.0-rubygem-formatador-0.3.0-1.1.x86_64
ruby3.0-rubygem-ffi-1.15.3-1.1.x86_64
ruby3.0-rubygem-excon-0.85.0-1.1.x86_64
ruby3.0-rubygem-erubi-1.10.0-1.4.x86_64
ruby3.0-rubygem-ed25519-1.2.4-1.15.x86_64
ruby3.0-rubygem-concurrent-ruby-1.1.9-1.1.x86_64
ruby3.0-rubygem-childprocess-4.1.0-1.1.x86_64
ruby3.0-rubygem-builder-3.2.4-1.8.x86_64
ruby3.0-rubygem-bcrypt_pbkdf-1.1.0-1.4.x86_64
ruby3.0-rubygem-ruby-libvirt-0.7.1-2.5.x86_64
ruby3.0-rubygem-net-sftp-3.0.0-1.6.x86_64
ruby3.0-rubygem-net-scp-3.0.0-1.6.x86_64
ruby3.0-rubygem-mime-types-3.3.1-1.8.x86_64
ruby3.0-rubygem-rb-kqueue-0.2.6-1.1.x86_64
ruby3.0-rubygem-rb-inotify-0.10.1-1.8.x86_64
ruby3.0-rubygem-gssapi-1.3.1-1.3.x86_64
ruby3.0-rubygem-vagrant_cloud-3.0.5-1.1.x86_64
ruby3.0-rubygem-i18n-1.8.10-1.1.x86_64
ruby3.0-rubygem-gyoku-1.3.1-1.16.x86_64
ruby3.0-rubygem-fog-core-2.2.4-1.1.x86_64
ruby3.0-rubygem-winrm-2.3.6-1.1.x86_64
ruby3.0-rubygem-fog-xml-0.1.3-1.8.x86_64
ruby3.0-rubygem-winrm-fs-1.3.5-1.5.x86_64
libvirt-libs-7.7.0-1.1.x86_64
ruby3.0-rubygem-fog-libvirt-0.9.0-1.1.x86_64
ruby3.0-rubygem-winrm-elevated-1.2.3-1.3.x86_64
libvirt-daemon-driver-libxl-7.7.0-1.1.x86_64
libvirt-daemon-driver-network-7.7.0-1.1.x86_64
libvirt-daemon-driver-secret-7.7.0-1.1.x86_64
libvirt-daemon-driver-qemu-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-rbd-7.7.0-1.1.x86_64
libvirt-daemon-config-network-7.7.0-1.1.x86_64
libvirt-daemon-driver-lxc-7.7.0-1.1.x86_64
libvirt-daemon-driver-nodedev-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-mpath-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-logical-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-iscsi-direct-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-iscsi-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-disk-7.7.0-1.1.x86_64
libvirt-daemon-config-nwfilter-7.7.0-1.1.x86_64
libvirt-client-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-7.7.0-1.1.x86_64
libvirt-daemon-driver-nwfilter-7.7.0-1.1.x86_64
libvirt-daemon-driver-storage-scsi-7.7.0-1.1.x86_64
vagrant-libvirt-0.5.3-2.1.x86_64
libvirt-daemon-driver-interface-7.7.0-1.1.x86_64
vagrant-2.2.18-1.2.x86_64
libvirt-daemon-driver-storage-core-7.7.0-1.1.x86_64

Debian testing:

ii  libruby2.7:amd64                     2.7.4-1                                amd64        Libraries necessary to run Ruby 2.7
ii  libsys-virt-perl                     7.0.0-1                                amd64        Perl module providing an extension for the libvirt library
ii  libvirt-clients                      7.0.0-3                                amd64        Programs for the libvirt library
ii  libvirt-daemon                       7.0.0-3                                amd64        Virtualization daemon
ii  libvirt-daemon-config-network        7.0.0-3                                all          Libvirt daemon configuration files (default network)
ii  libvirt-daemon-config-nwfilter       7.0.0-3                                all          Libvirt daemon configuration files (default network filters)
ii  libvirt-daemon-driver-lxc            7.0.0-3                                amd64        Virtualization daemon LXC connection driver
ii  libvirt-daemon-driver-qemu           7.0.0-3                                amd64        Virtualization daemon QEMU connection driver
ii  libvirt-daemon-driver-vbox           7.0.0-3                                amd64        Virtualization daemon VirtualBox connection driver
ii  libvirt-daemon-driver-xen            7.0.0-3                                amd64        Virtualization daemon Xen connection driver
ii  libvirt-daemon-system                7.0.0-3                                amd64        Libvirt daemon configuration files
ii  libvirt-daemon-system-systemd        7.0.0-3                                all          Libvirt daemon configuration files (systemd)
ii  libvirt-dev:amd64                    7.0.0-3                                amd64        development files for the libvirt library
ii  libvirt0:amd64                       7.0.0-3                                amd64        library for interfacing with different virtualization systems
ii  rake                                 13.0.3-1                               all          ruby make-like utility
ii  ruby                                 1:2.7+2                                amd64        Interpreter of object-oriented scripting language Ruby (default version)
ii  ruby-bcrypt-pbkdf                    1.1.0-1                                amd64        Ruby library for OpenBSD's bcrypt_pdkfd
ii  ruby-builder                         3.2.4-1                                all          Builders for XML MarkUps
ii  ruby-childprocess                    4.0.0-1                                all          Ruby library for controlling external programs running in the background
ii  ruby-concurrent                      1.1.6+dfsg-3                           all          modern concurrency tools for Ruby
ii  ruby-domain-name                     0.5.20190701-1                         all          Domain Name manipulation library for Ruby
ii  ruby-ed25519                         1.2.4-2+b1                             amd64        efficient digital signature library providing the Ed25519 algorithm
ii  ruby-erubi                           1.9.0-1                                all          Small ERB Implementation
ii  ruby-erubis                          2.7.0-3.1                              all          fast and extensible eRuby implementation which supports multi-language
ii  ruby-excon                           0.79.0-2                               all          Ruby library for creating fast, persistent, HTTP(S) connections
ii  ruby-ffi                             1.12.2+dfsg-2+b3                       amd64        load dynamic libraries, bind functions from within ruby code
ii  ruby-fog-core                        2.1.0-3.1                              all          shared classes and tests for fog providers and services
ii  ruby-fog-json                        1.2.0-1                                all          JSON parsing for fog providers
ii  ruby-fog-libvirt                     0.8.0-1                                all          Module for the 'fog' gem to support libvirt
ii  ruby-fog-xml                         0.1.1-5.1                              all          XML parsing for fog providers
ii  ruby-formatador                      0.2.5-2.1                              all          Ruby STDOUT text formatting library
ii  ruby-http-accept                     1.7.0-2                                all          Ruby parser for Accept and Accept-Language HTTP headers
ii  ruby-http-cookie                     1.0.3-1                                all          Ruby library to handle HTTP Cookies based on RFC 6265
ii  ruby-i18n                            1.8.8-1                                all          I18n and localization solution for Ruby
ii  ruby-json                            2.3.0+dfsg-1+b3                        amd64        JSON library for Ruby
ii  ruby-libvirt                         0.7.1-1+b3                             amd64        Ruby bindings for libvirt
ii  ruby-listen                          3.2.1-1                                all          Ruby library listening to file modifications
ii  ruby-log4r                           1.1.10-4.1                             all          Logging library for Ruby
ii  ruby-mime-types                      3.3.1-1                                all          guess MIME type of files
ii  ruby-mime-types-data                 3.2015.1120-1.1                        all          registry for information about MIME media type definitions
ii  ruby-minitest                        5.13.0-1                               all          Ruby test tools supporting TDD, BDD, mocking, and benchmarking
ii  ruby-multi-json                      1.14.1-1                               all          Ruby library to provide easy switching between different JSON backends
ii  ruby-net-scp                         3.0.0-1                                all          pure ruby implementation of the SCP protocol
ii  ruby-net-sftp                        1:3.0.0-1                              all          Ruby implementation of the SFTP protocol
ii  ruby-net-ssh                         1:6.1.0-2                              all          Ruby implementation of the SSH protocol
ii  ruby-net-telnet                      0.1.1-2                                all          telnet client library
ii  ruby-netrc                           0.11.0-3                               all          Ruby library to read and write netrc files
ii  ruby-nokogiri                        1.11.1+dfsg-2                          amd64        HTML, XML, SAX, and Reader parser for Ruby
ii  ruby-oj:amd64                        3.11.0-1                               amd64        fast JSON parser and serializer for Ruby
ii  ruby-pkg-config                      1.4.4-1                                all          pkg-config implementation for Ruby
ii  ruby-power-assert                    1.1.7-2                                all          library showing values of variables and method calls in an expression
ii  ruby-rb-inotify                      0.9.10-2                               all          Ruby wrapper for Linux's inotify, using FFI
ii  ruby-rest-client                     2.1.0-1                                all          simple REST client for Ruby
ii  ruby-rubygems                        3.2.5-2                                all          Package management framework for Ruby
ii  ruby-sqlite3                         1.4.2-3                                amd64        SQLite3 interface for Ruby
ii  ruby-test-unit                       3.3.9-1                                all          unit testing framework for Ruby
ii  ruby-unf                             0.1.4-2                                all          Wrapper library to bring Unicode Normalization Form support to Ruby
ii  ruby-unf-ext                         0.0.7.6-1+b3                           amd64        Unicode Normalization Form support library for CRuby
ii  ruby-vagrant-cloud                   3.0.2-1                                all          Vagrant Cloud API Library
ii  ruby-xmlrpc                          0.3.0-2                                all          XMLRPC library for Ruby
ii  ruby-zip                             2.0.0-2                                all          Ruby module for reading and writing zip files
ii  ruby2.7                              2.7.4-1                                amd64        Interpreter of object-oriented scripting language Ruby
ii  rubygems-integration                 1.18                                   all          integration of Debian Ruby packages with Rubygems
ii  vagrant                              2.2.14+dfsg-1                          all          Tool for building and distributing virtualized development environments
ii  vagrant-libvirt                      0.3.0-1                                all          Vagrant plugin that adds an Libvirt provider to Vagrant
Comment 16 Dan Čermák 2021-09-24 08:27:44 UTC
(In reply to Luis Chamberlain from comment #14)
> (In reply to Dan Čermák from comment #12)
> > (In reply to Luis Chamberlain from comment #5)
> > > (In reply to Dan Čermák from comment #3)
> > > > What is the actual issue at hand here? 
> > > 
> > > After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh
> > > access times out (over 600 seconds default is the ansible default).
> > > 
> > > > I am honestly puzzled what actually
> > > > the problem is: do your VMs time out to reboot? Do you reboot them via
> > > > vagrant?
> > > 
> > > No, ansible is used to reboot, as noted in the first comment. I also looked
> > > how ansible does this and it is: /sbin/shutdown -r 0
> > > 
> > > I also noted exactly the algorithm used on comment #1 about how it
> > > determines if the reboot was successful or not.
> > > 
> > > >  Can you reproduce this consistently?
> > > 
> > > Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10,
> > > sometimes with 80, sometimes after 600.
> > > 
> > > On debian testing I have no issues and got up to 1,000,000 easy.
> > 
> > Could you export the environment variable `VAGRANT_LOG=DEBUG` during your
> > tests, capture stdout & stderr and upload the last logfile when the box
> > times out?
> 
> There are no vagrant commands run for the reboot test. The provisioning all
> works fine. So do I still set this up when I run the ansible command to run
> the reboot-limit playbook?
 
Ah, ok. So vagrant just performs the `vagrant up` and that's it? Then I honestly don't think that vagrant itself is the culprit here. Also, given James' tests, I would expect that neither is libvirt, which leaves Ansible.

Could you retry your tests with Ansible installed from pip (preferably the same version that you are using on Debian)?

If you are really motivated, you could retry the whole ordeal with the Virtualbox provider.
Comment 17 Luis Chamberlain 2021-09-24 21:07:11 UTC
(In reply to Dan Čermák from comment #16)
> (In reply to Luis Chamberlain from comment #14)
> > (In reply to Dan Čermák from comment #12)
> > > (In reply to Luis Chamberlain from comment #5)
> > > > (In reply to Dan Čermák from comment #3)
> > > > > What is the actual issue at hand here? 
> > > > 
> > > > After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh
> > > > access times out (over 600 seconds default is the ansible default).
> > > > 
> > > > > I am honestly puzzled what actually
> > > > > the problem is: do your VMs time out to reboot? Do you reboot them via
> > > > > vagrant?
> > > > 
> > > > No, ansible is used to reboot, as noted in the first comment. I also looked
> > > > how ansible does this and it is: /sbin/shutdown -r 0
> > > > 
> > > > I also noted exactly the algorithm used on comment #1 about how it
> > > > determines if the reboot was successful or not.
> > > > 
> > > > >  Can you reproduce this consistently?
> > > > 
> > > > Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10,
> > > > sometimes with 80, sometimes after 600.
> > > > 
> > > > On debian testing I have no issues and got up to 1,000,000 easy.
> > > 
> > > Could you export the environment variable `VAGRANT_LOG=DEBUG` during your
> > > tests, capture stdout & stderr and upload the last logfile when the box
> > > times out?
> > 
> > There are no vagrant commands run for the reboot test. The provisioning all
> > works fine. So do I still set this up when I run the ansible command to run
> > the reboot-limit playbook?
>  
> Ah, ok. So vagrant just performs the `vagrant up` and that's it? 

Right.

> Then I
> honestly don't think that vagrant itself is the culprit here. Also, given
> James' tests, I would expect that neither is libvirt, which leaves Ansible.

It was not clear to me if vagrant uses a different dnsmasq or  not, but sure we can try this approach.

> Could you retry your tests with Ansible installed from pip (preferably the
> same version that you are using on Debian)?

This required

zypper remove ansible
pip install ansible==2.10.7

Tests are on their way.
Comment 18 James Fehlig 2021-09-26 15:23:41 UTC
I checked in on my test today

Got pre-reboot uuid = f39b6cb8-86d5-46a3-8b6d-299630e74999
Rebooting VM...
Reboot scheduled for Sun 2021-09-26 09:17:32 MDT, use 'shutdown -c' to cancel.
waiting for guest to boot...
VM is up!!
Got post-reboot uuid = ae38d87b-8251-4205-8687-e30150185015
Boot UUIDs are different, reboot succeeded!
Iteration: 12292 done

After more than 12k iterations I killed it. Will reboot the host to TW, update it, and start the test again with TW hypervisor and 15 SP3 guest.
Comment 19 Luis Chamberlain 2021-09-27 16:20:57 UTC
(In reply to Luis Chamberlain from comment #17)
> (In reply to Dan Čermák from comment #16)
> > (In reply to Luis Chamberlain from comment #14)
> > > (In reply to Dan Čermák from comment #12)
> > > > (In reply to Luis Chamberlain from comment #5)
> > > > > (In reply to Dan Čermák from comment #3)
> > > > > > What is the actual issue at hand here? 
> > > > > 
> > > > > After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh
> > > > > access times out (over 600 seconds default is the ansible default).
> > > > > 
> > > > > > I am honestly puzzled what actually
> > > > > > the problem is: do your VMs time out to reboot? Do you reboot them via
> > > > > > vagrant?
> > > > > 
> > > > > No, ansible is used to reboot, as noted in the first comment. I also looked
> > > > > how ansible does this and it is: /sbin/shutdown -r 0
> > > > > 
> > > > > I also noted exactly the algorithm used on comment #1 about how it
> > > > > determines if the reboot was successful or not.
> > > > > 
> > > > > >  Can you reproduce this consistently?
> > > > > 
> > > > > Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10,
> > > > > sometimes with 80, sometimes after 600.
> > > > > 
> > > > > On debian testing I have no issues and got up to 1,000,000 easy.
> > > > 
> > > > Could you export the environment variable `VAGRANT_LOG=DEBUG` during your
> > > > tests, capture stdout & stderr and upload the last logfile when the box
> > > > times out?
> > > 
> > > There are no vagrant commands run for the reboot test. The provisioning all
> > > works fine. So do I still set this up when I run the ansible command to run
> > > the reboot-limit playbook?
> >  
> > Ah, ok. So vagrant just performs the `vagrant up` and that's it? 
> 
> Right.
> 
> > Then I
> > honestly don't think that vagrant itself is the culprit here. Also, given
> > James' tests, I would expect that neither is libvirt, which leaves Ansible.
> 
> It was not clear to me if vagrant uses a different dnsmasq or  not, but sure
> we can try this approach.
> 
> > Could you retry your tests with Ansible installed from pip (preferably the
> > same version that you are using on Debian)?
> 
> This required
> 
> zypper remove ansible
> pip install ansible==2.10.7
> 
> Tests are on their way.

So far 100 reboots worked with this version of ansible, 2.10.7, and by removing the post_reboot_delay workaround, going to kick off a 10,000 test now.
Comment 20 Luis Chamberlain 2021-09-27 18:40:01 UTC
(In reply to Luis Chamberlain from comment #19)
> So far 100 reboots worked with this version of ansible, 2.10.7, and by
> removing the post_reboot_delay workaround, going to kick off a 10,000 test
> now.

Hit a failure on reboot #19 a second time around due to the timeout of 600 seconds.

TASK [reboot-limit : Run the reboot test using the ansible reboot module] ******                                                                                             
fatal: [sles15sp3]: FAILED! => {"changed": false, "elapsed": 7305, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true}
Comment 22 James Fehlig 2021-09-27 20:35:31 UTC
Created attachment 852778 [details]
reboot test

This is my hack script for a reboot test. I don't think it needs much explanation, but feel free to check for correctness to test reboot :-).

I had to tweak it a bit to work on a TW host. Previously I used ping to determine when the guest was up, and then waited 10sec for sshd to come alive in the guest before trying to ssh and extract /proc/sys/kernel/random/boot_id. This worked fine on 15 SP3, but on TW I noticed sshd might not be up within the 10sec. I adjusted the script to check for ssh connection when determining if the guest is up and restarted the test. Currently at 180 iterations.
Comment 23 Luis Chamberlain 2021-09-27 22:16:18 UTC
(In reply to James Fehlig from comment #22)
> Created attachment 852778 [details]
> reboot test
> 
> This is my hack script for a reboot test. I don't think it needs much
> explanation, but feel free to check for correctness to test reboot :-).
> 
> I had to tweak it a bit to work on a TW host. Previously I used ping to
> determine when the guest was up, and then waited 10sec for sshd to come
> alive in the guest before trying to ssh and extract
> /proc/sys/kernel/random/boot_id. This worked fine on 15 SP3, but on TW I
> noticed sshd might not be up within the 10sec.

Ansible's default timeout is 600 seconds. It is not clear to me exactly how ansible determines if the timeout lapsed though. However, it would seem what you are seeing might be related to the failure that ansible picks up on.

*Why* does it seem that ssh does not come up as quickly on Tumblweed Vs SLE15-SP3?
Comment 24 James Fehlig 2021-09-28 21:22:56 UTC
(In reply to Luis Chamberlain from comment #23)
> Ansible's default timeout is 600 seconds. It is not clear to me exactly how
> ansible determines if the timeout lapsed though. However, it would seem what
> you are seeing might be related to the failure that ansible picks up on.

Perhaps. Let me word my observations another way: In a SLES15 SP3 guest, the time it takes for sshd to service requests once the network has come to life (responds to ping) is greater on a TW host than a SLES15 SP3 host.

> *Why* does it seem that ssh does not come up as quickly on Tumblweed Vs
> SLE15-SP3?

Not sure. The primary components in play that are different between the two are qemu and/or the host kernel. Maybe some vcpu scheduling or other small performance issue in TW.

That said, I can't confirm 600 seconds was ever exceeded. With the original version of my script that ran on SLES15 SP3 host, I can only confirm the guest's network was up within 600 seconds. The updated version of the script in #22 will bail out if the guest's ssh service does not respond within 600 seconds. BTW, the updated script has now run 2318 iterations on the TW host.
Comment 25 Luis Chamberlain 2021-09-28 22:26:45 UTC
I'm under the impression the issue is ssh command failures. It would seem by default there are no retries done by ansible. Using the following seems to be a bit promising:

diff --git a/ansible.cfg b/ansible.cfg
index b513ff9..4889be0 100644
--- a/ansible.cfg
+++ b/ansible.cfg
@@ -1,2 +1,3 @@
 [defaults]
 display_skipped_hosts = no
+retries = 2000


In the sense that at least for some hypervisors were I was seeing some really odd failures for other tests (blktests), I am not seeing the issues anymore. I'll be testing this with the reboot-limit workflow however, if this fixes the issue, this would seem to be a workaround for a smooth ride. On debian this is not needed, for instance.
Comment 26 Luis Chamberlain 2021-09-28 22:29:27 UTC
(In reply to James Fehlig from comment #24)
> (In reply to Luis Chamberlain from comment #23)
> > Ansible's default timeout is 600 seconds. It is not clear to me exactly how
> > ansible determines if the timeout lapsed though. However, it would seem what
> > you are seeing might be related to the failure that ansible picks up on.
> 
> Perhaps. Let me word my observations another way: In a SLES15 SP3 guest, the
> time it takes for sshd to service requests once the network has come to life
> (responds to ping) is greater on a TW host than a SLES15 SP3 host.

OK it would be nice to know why, but I get it, we have other priorities.

> > *Why* does it seem that ssh does not come up as quickly on Tumblweed Vs
> > SLE15-SP3?
> 
> Not sure. The primary components in play that are different between the two
> are qemu and/or the host kernel. Maybe some vcpu scheduling or other small
> performance issue in TW.
> 
> That said, I can't confirm 600 seconds was ever exceeded.

I'm under the impression certain ssh packets get lost somehow, and since it would seem ansible does not retry by default, this is fatal condition with default ansible settings.

I have 4 different systems using TW where this issue is present and sporadic, so it is not clear *why* the packets might be lost but this is my current suspicion.

>  With the original
> version of my script that ran on SLES15 SP3 host, I can only confirm the
> guest's network was up within 600 seconds. The updated version of the script
> in #22 will bail out if the guest's ssh service does not respond within 600
> seconds. BTW, the updated script has now run 2318 iterations on the TW host.

Nice thanks so much for testing this.
Comment 27 Shung-Hsi Yu 2021-09-29 06:32:08 UTC
Randomly throwing my 2¢

If Ansible(In reply to Luis Chamberlain from comment #26)
> [snip]
> > > *Why* does it seem that ssh does not come up as quickly on Tumblweed Vs
> > > SLE15-SP3?
> > 
> > Not sure. The primary components in play that are different between the two
> > are qemu and/or the host kernel. Maybe some vcpu scheduling or other small
> > performance issue in TW.
> > 
> > That said, I can't confirm 600 seconds was ever exceeded.
> 
> I'm under the impression certain ssh packets get lost somehow, and since it
> would seem ansible does not retry by default, this is fatal condition with
> default ansible settings.

If ansible does not retry by default, then perhaps tuning sysctl net.ipv4.tcp_syn_retries on TW host may help, as it allows the host OS to wait a bit longer during initial TCP connection to guest sshd.

The default seems to be 6 retires, which timeouts around 130s[1], `sudo sysctl -w net.ipv4.tcp_syn_retries=8` should tell to OS to await for a bit less to 600s. (Note: the setting is reset when rebooted)

---

It may also help to capture the network traffic between the host and guest with something like `sudo tcpdump -i $VIRT_INTERFACE -W 3 -G 600 -w dump`, which only captures packets within last 30 minutes split in 3 files, interrupt it when timeout has occurred and upload them here for analysis.

1: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/#synsent
Comment 28 James Fehlig 2021-10-05 05:18:01 UTC
I need to use the test machine for other purposes so killed my reboot test. It was on iteration 15447. I think it is safe to say reboot of VMs is working fine on both SLE15 SP3 and TW.
Comment 29 Luis Chamberlain 2021-10-12 23:49:24 UTC
(In reply to Shung-Hsi Yu from comment #27)
> Randomly throwing my 2¢
> 
> If Ansible(In reply to Luis Chamberlain from comment #26)
> > [snip]
> > > > *Why* does it seem that ssh does not come up as quickly on Tumblweed Vs
> > > > SLE15-SP3?
> > > 
> > > Not sure. The primary components in play that are different between the two
> > > are qemu and/or the host kernel. Maybe some vcpu scheduling or other small
> > > performance issue in TW.
> > > 
> > > That said, I can't confirm 600 seconds was ever exceeded.
> > 
> > I'm under the impression certain ssh packets get lost somehow, and since it
> > would seem ansible does not retry by default, this is fatal condition with
> > default ansible settings.
> 
> If ansible does not retry by default, then perhaps tuning sysctl
> net.ipv4.tcp_syn_retries on TW host may help, as it allows the host OS to
> wait a bit longer during initial TCP connection to guest sshd.

Thanks for the tip! Question which is pretty relevant here is if this really should be needed for a virtual network for libvirt?

FWIW the default seems to be 6 for both TW and SLES15-SP3.

> The default seems to be 6 retires, which timeouts around 130s[1], `sudo
> sysctl -w net.ipv4.tcp_syn_retries=8` should tell to OS to await for a bit
> less to 600s. (Note: the setting is reset when rebooted)

I will try this if increasing the retries for ansible fails.

> It may also help to capture the network traffic between the host and guest
> with something like `sudo tcpdump -i $VIRT_INTERFACE -W 3 -G 600 -w dump`,
> which only captures packets within last 30 minutes split in 3 files,
> interrupt it when timeout has occurred and upload them here for analysis.

Thanks this can be a next step here for sure!
Comment 30 Luis Chamberlain 2021-10-13 21:22:48 UTC
(In reply to Luis Chamberlain from comment #25)
> diff --git a/ansible.cfg b/ansible.cfg
> index b513ff9..4889be0 100644
> --- a/ansible.cfg
> +++ b/ansible.cfg
> @@ -1,2 +1,3 @@
>  [defaults]
>  display_skipped_hosts = no
> +retries = 2000
> 

This setting only got me to 8 reboots... so going to try the sysctl -w net.ipv4.tcp_syn_retries=8 now...
Comment 31 Luis Chamberlain 2021-10-14 16:24:22 UTC
(In reply to Luis Chamberlain from comment #30)
> (In reply to Luis Chamberlain from comment #25)
> > diff --git a/ansible.cfg b/ansible.cfg
> > index b513ff9..4889be0 100644
> > --- a/ansible.cfg
> > +++ b/ansible.cfg
> > @@ -1,2 +1,3 @@
> >  [defaults]
> >  display_skipped_hosts = no
> > +retries = 2000
> > 
> 
> This setting only got me to 8 reboots... so going to try the sysctl -w
> net.ipv4.tcp_syn_retries=8 now...

This only helps slightly, got up to 11 reboots. Will try to capture tcpdump...
Comment 32 Luis Chamberlain 2021-10-14 19:48:09 UTC
I have the tcpdump ready, its 13 MiB, I've compressed it with bzip2. You can find it on my home directory on wotan:bsc1190632.dump.bz2

This was captured during a run of:

make reboot-limit-baseline-kernelci

The command used to capture the dump:

sudo tcpdump -i vnet44 -W 3 -G 600 -w dump

I used vnet44 because:

mcgrof@ives /data1/reboot-limit/sles15sp3/sle-kdevops/kdevops (git::master)$ sudo virsh dumpxml vagrant_sles15sp3 | grep vnet
      <target dev='vnet44'/>
      <target dev='vnet45'/>

And the extra network on the guest should be idle.
Comment 33 Luis Chamberlain 2021-11-12 20:01:46 UTC
Finally, I came to an ansible release for which the timeout issues are not happening. I used as base OpenSUSE Tumbleweed. The version of ansible:

mcgrof@gehrke ~ $ sudo pip show ansible                                         
Name: ansible                                                                   
Version: 4.6.0                                                                  
Summary: Radically simple IT automation                                         
Home-page: https://ansible.com/                                                 
Author: Ansible, Inc.                                                           
Author-email: info@ansible.com                                                  
License: GPLv3+                                                                 
Location: /usr/local/lib/python3.8/site-packages                                
Requires: ansible-core                                                          
Required-by:                                                                    
                                                                                
mcgrof@gehrke ~ $ sudo pip show ansible-base                                    
Name: ansible-base                                                              
Version: 2.10.14                                                                
Summary: Radically simple IT automation                                         
Home-page: https://ansible.com/                                                 
Author: Ansible, Inc.                                                           
Author-email: info@ansible.com                                                  
License: GPLv3+                                                                 
Location: /usr/local/lib/python3.8/site-packages                                
Requires: PyYAML, cryptography, jinja2, packaging                               
Required-by:
Comment 34 Dan Čermák 2021-11-15 07:07:33 UTC
Closing this as resolved then.
Comment 35 Luis Chamberlain 2021-11-15 17:16:17 UTC
(In reply to Dan Čermák from comment #34)
> Closing this as resolved then.

Unfortunately the issue is not resolved yet. The reboot limit workflow does not pass on two of the hosts: ives (failed on reboot 2) and huber (failed on reboot 4).
Comment 36 Luis Chamberlain 2021-11-15 17:19:57 UTC
(In reply to Luis Chamberlain from comment #35)
> (In reply to Dan Čermák from comment #34)
> > Closing this as resolved then.
> 
> Unfortunately the issue is not resolved yet. The reboot limit workflow does
> not pass on two of the hosts: ives (failed on reboot 2) and huber (failed on
> reboot 4).

So there are 4 kernel-ci hosts, and below is their status on this test:

 * gehrke: passed 100 loops (so 10,000 reboots)
 * hadar: on reboot loop number 64 (ie, so 6,400 reboots)
 * ives: failed on test loop 2 (200 reboots)
 * huber: failed on test loop 4 (400 reboots)
Comment 37 Luis Chamberlain 2021-11-15 17:20:55 UTC
(In reply to Luis Chamberlain from comment #36)

>  * hadar: on reboot loop number 64 (ie, so 6,400 reboots)

And to be clear this test is ongoing, and so we'll see if it reaches loop 100 successfully
Comment 38 Luis Chamberlain 2022-05-27 18:27:15 UTC
You may be interested in these findings too:

https://bugzilla.kernel.org/show_bug.cgi?id=216040