Bug 1212816

Summary: salt-minions from some machines do not return "Minion did not return. [No response]" since salt-3006.0-150400.8.34.2
Product: [openSUSE] openSUSE Distribution Reporter: Oliver Kurz <okurz>
Component: SaltAssignee: E-Mail List <salt-maintainers>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: artem.shiliaev, marius.kittler, okurz, pablo.suarezhernandez, vzhestkov
Version: Leap 15.4   
Target Milestone: ---   
Hardware: Other   
OS: Other   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1213960
https://bugzilla.suse.com/show_bug.cgi?id=1213257
https://bugzilla.suse.com/show_bug.cgi?id=1213630
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Logs on sapworker2

Description Oliver Kurz 2023-06-28 13:11:29 UTC
## Observation

We observed that in our salt managed infrastructure the salt-minion from some machines do not return any result reproducibly since the upgrade salt-3004-150400.8.25.1->salt-3006.0-150400.8.34.2
salt-minions from some machines do not return "Minion did not return. [No response]" since salt-3006.0-150400.8.34.2. In our infrastructure we currently have 31 machines controlled by salt. Out of those 31 machines reproducibly 6 machines do not return with a response after some time regardless of the salt command used. The other 25 machines are not affected, all have the most up-to-date salt package version installed and an up-to-date Leap 15.4. We run machines with architectures x86_64, aarch64, ppc64le and have affected as well as non-affected machines of each architecture so this can not be an architecture-specific issue. The issue is visible when executing any salt command like `test.ping` as well as in `salt-run manage.down` showing the unresponsive salt minion nodes.

## Steps to reproduce

In our infrastructure we reproduce the problem with:

```
for i in {1..7200}; do echo "### Run $i -- $(date -Is)" && salt --no-color \* test.ping ; df -i / ; salt-run jobs.list_jobs | wc -l && salt --no-color \* saltutil.kill_all_jobs && sleep 60 && rm -rf /var/cache/salt/master/jobs/*; done | tee -a log_salt_test_ping_poo131249_$(date -Is).log
```

showing the problem for the affected machines after roughly one hour.

The command can be simplified by removing counting, logging, cleanup:

```
while :; do salt \* test.ping; done
```

Note that without sleeping between each loop iteration this eventually exhausted free inodes on our salt master node.


## Problem

We could narrow down the problem to the salt packages upgrade salt-3004-150400.8.25.1 to salt-3006.0-150400.8.34.2 .
Full changelog of the package for the above mentioned upgrade step:

```
* Mon Jun 19 2023 pablo.suarezhernandez@suse.com
- Make master_tops compatible with Salt 3000 and older minions (bsc#1212516) (bsc#1212517)
- Added:
  * make-master_tops-compatible-with-salt-3000-and-older.patch

* Mon May 29 2023 yeray.gutierrez@suse.com
- Avoid failures due transactional_update module not available in Salt 3006.0 (bsc#1211754)
- Added:
  * define-__virtualname__-for-transactional_update-modu.patch

* Wed May 24 2023 pablo.suarezhernandez@suse.com
- Avoid conflicts with Salt dependencies versions (bsc#1211612)
- Added:
  * avoid-conflicts-with-dependencies-versions-bsc-12116.patch

* Fri May 05 2023 alexander.graul@suse.com
- Update to Salt release version 3006.0 (jsc#PED-4360)
  * See release notes: https://docs.saltproject.io/en/latest/topics/releases/3006.0.html
- Add missing patch after rebase to fix collections Mapping issues
- Add python3-looseversion as new dependency for salt
- Add python3-packaging as new dependency for salt
- Allow entrypoint compatibility for "importlib-metadata>=5.0.0" (bsc#1207071)
- Create new salt-tests subpackage containing Salt tests
- Drop conflictive patch dicarded from upstream
- Fix SLS rendering error when Jinja macros are used
- Fix version detection and avoid building and testing failures
- Prevent deadlocks in salt-ssh executions
- Require python3-jmespath runtime dependency (bsc#1209233)
- Added:
  * 3005.1-implement-zypper-removeptf-573.patch
  * control-the-collection-of-lvm-grains-via-config.patch
  * fix-version-detection-and-avoid-building-and-testing.patch
  * make-sure-the-file-client-is-destroyed-upon-used.patch
  * skip-package-names-without-colon-bsc-1208691-578.patch
  * use-rlock-to-avoid-deadlocks-in-salt-ssh.patch
- Modified:
  * activate-all-beacons-sources-config-pillar-grains.patch
  * add-custom-suse-capabilities-as-grains.patch
  * add-environment-variable-to-know-if-yum-is-invoked-f.patch
  * add-migrated-state-and-gpg-key-management-functions-.patch
  * add-publish_batch-to-clearfuncs-exposed-methods.patch
  * add-salt-ssh-support-with-venv-salt-minion-3004-493.patch
  * add-sleep-on-exception-handling-on-minion-connection.patch
  * add-standalone-configuration-file-for-enabling-packa.patch
  * add-support-for-gpgautoimport-539.patch
  * allow-vendor-change-option-with-zypper.patch
  * async-batch-implementation.patch
  * avoid-excessive-syslogging-by-watchdog-cronjob-58.patch
  * bsc-1176024-fix-file-directory-user-and-group-owners.patch
  * change-the-delimeters-to-prevent-possible-tracebacks.patch
  * debian-info_installed-compatibility-50453.patch
  * dnfnotify-pkgset-plugin-implementation-3002.2-450.patch
  * do-not-load-pip-state-if-there-is-no-3rd-party-depen.patch
  * don-t-use-shell-sbin-nologin-in-requisites.patch
  * drop-serial-from-event.unpack-in-cli.batch_async.patch
  * early-feature-support-config.patch
  * enable-passing-a-unix_socket-for-mysql-returners-bsc.patch
  * enhance-openscap-module-add-xccdf_eval-call-386.patch
  * fix-bsc-1065792.patch
  * fix-for-suse-expanded-support-detection.patch
  * fix-issue-2068-test.patch
  * fix-missing-minion-returns-in-batch-mode-360.patch
  * fix-ownership-of-salt-thin-directory-when-using-the-.patch
  * fix-regression-with-depending-client.ssh-on-psutil-b.patch
  * fix-salt-ssh-opts-poisoning-bsc-1197637-3004-501.patch
  * fix-salt.utils.stringutils.to_str-calls-to-make-it-w.patch
  * fix-the-regression-for-yumnotify-plugin-456.patch
  * fix-traceback.print_exc-calls-for-test_pip_state-432.patch
  * fixes-for-python-3.10-502.patch
  * include-aliases-in-the-fqdns-grains.patch
  * info_installed-works-without-status-attr-now.patch
  * let-salt-ssh-use-platform-python-binary-in-rhel8-191.patch
  * make-aptpkg.list_repos-compatible-on-enabled-disable.patch
  * make-setup.py-script-to-not-require-setuptools-9.1.patch
  * pass-the-context-to-pillar-ext-modules.patch
  * prevent-affection-of-ssh.opts-with-lazyloader-bsc-11.patch
  * prevent-pkg-plugins-errors-on-missing-cookie-path-bs.patch
  * prevent-shell-injection-via-pre_flight_script_args-4.patch
  * read-repo-info-without-using-interpolation-bsc-11356.patch
  * restore-default-behaviour-of-pkg-list-return.patch
  * return-the-expected-powerpc-os-arch-bsc-1117995.patch
  * revert-fixing-a-use-case-when-multiple-inotify-beaco.patch
  * run-salt-api-as-user-salt-bsc-1064520.patch
  * run-salt-master-as-dedicated-salt-user.patch
  * save-log-to-logfile-with-docker.build.patch
  * switch-firewalld-state-to-use-change_interface.patch
  * temporary-fix-extend-the-whitelist-of-allowed-comman.patch
  * update-target-fix-for-salt-ssh-to-process-targets-li.patch
  * use-adler32-algorithm-to-compute-string-checksums.patch
  * use-salt-bundle-in-dockermod.patch
  * x509-fixes-111.patch
  * zypperpkg-ignore-retcode-104-for-search-bsc-1176697-.patch
- Removed:
  * 3003.3-do-not-consider-skipped-targets-as-failed-for.patch
  * 3003.3-postgresql-json-support-in-pillar-423.patch
  * add-amazon-ec2-detection-for-virtual-grains-bsc-1195.patch
  * add-missing-ansible-module-functions-to-whitelist-in.patch
  * add-rpm_vercmp-python-library-for-version-comparison.patch
  * add-support-for-name-pkgs-and-diff_attr-parameters-t.patch
  * adds-explicit-type-cast-for-port.patch
  * align-amazon-ec2-nitro-grains-with-upstream-pr-bsc-1.patch
  * backport-syndic-auth-fixes.patch
  * batch.py-avoid-exception-when-minion-does-not-respon.patch
  * check-if-dpkgnotify-is-executable-bsc-1186674-376.patch
  * clarify-pkg.installed-pkg_verify-documentation.patch
  * detect-module.run-syntax.patch
  * do-not-crash-when-unexpected-cmd-output-at-listing-p.patch
  * enhance-logging-when-inotify-beacon-is-missing-pyino.patch
  * fix-62092-catch-zmq.error.zmqerror-to-set-hwm-for-zm.patch
  * fix-crash-when-calling-manage.not_alive-runners.patch
  * fixes-pkg.version_cmp-on-openeuler-systems-and-a-few.patch
  * fix-exception-in-yumpkg.remove-for-not-installed-pac.patch
  * fix-for-cve-2022-22967-bsc-1200566.patch
  * fix-inspector-module-export-function-bsc-1097531-481.patch
  * fix-ip6_interface-grain-to-not-leak-secondary-ipv4-a.patch
  * fix-issues-with-salt-ssh-s-extra-filerefs.patch
  * fix-jinja2-contextfuntion-base-on-version-bsc-119874.patch
  * fix-multiple-security-issues-bsc-1197417.patch
  * fix-salt-call-event.send-call-with-grains-and-pillar.patch
  * fix-salt.states.file.managed-for-follow_symlinks-tru.patch
  * fix-state.apply-in-test-mode-with-file-state-module-.patch
  * fix-test_ipc-unit-tests.patch
  * fix-the-regression-in-schedule-module-releasded-in-3.patch
  * fix-wrong-test_mod_del_repo_multiline_values-test-af.patch
  * fixes-56144-to-enable-hotadd-profile-support.patch
  * fopen-workaround-bad-buffering-for-binary-mode-563.patch
  * force-zyppnotify-to-prefer-packages.db-than-packages.patch
  * ignore-erros-on-reading-license-files-with-dpkg_lowp.patch
  * ignore-extend-declarations-from-excluded-sls-files.patch
  * ignore-non-utf8-characters-while-reading-files-with-.patch
  * implementation-of-held-unheld-functions-for-state-pk.patch
  * implementation-of-suse_ip-execution-module-bsc-10999.patch
  * improvements-on-ansiblegate-module-354.patch
  * include-stdout-in-error-message-for-zypperpkg-559.patch
  * make-pass-renderer-configurable-other-fixes-532.patch
  * make-sure-saltcacheloader-use-correct-fileclient-519.patch
  * mock-ip_addrs-in-utils-minions.py-unit-test-443.patch
  * normalize-package-names-once-with-pkg.installed-remo.patch
  * notify-beacon-for-debian-ubuntu-systems-347.patch
  * refactor-and-improvements-for-transactional-updates-.patch
  * retry-if-rpm-lock-is-temporarily-unavailable-547.patch
  * set-default-target-for-pip-from-venv_pip_target-envi.patch
  * state.apply-don-t-check-for-cached-pillar-errors.patch
  * state.orchestrate_single-does-not-pass-pillar-none-4.patch
  * support-transactional-systems-microos.patch
  * wipe-notify_socket-from-env-in-cmdmod-bsc-1193357-30.patch
```



## Workaround

Restarting the salt-minion systemd service on affected machines mitigates the problem for one to multiple hours until the minion becomes unresponsive again.
`systemctl restart salt-minion`.

For now we have downgraded the affected machines except for one which we keep as purposely broken machine which we can offer with some limitations applied to anyone interested in investigating further.

## Further details

Please find our internal investigation issue on
https://progress.opensuse.org/issues/131249
Comment 1 Pablo Suárez Hernández 2023-06-29 09:49:22 UTC
Could we get logs from one of the minions that get stuck?

I had my "salt-master" and "salt-minion" running (on the same instance) and performing "test.ping" every 5 seconds during last night and minion is still responsive.

It would be nice to have at least logs at "/var/log/salt/minion" on the stuck minions.

We also had situations in the past where some of the processes from "salt-minion" got killed by OOM killer, and getting minion stuck. We should check system journal to see if OOM was invoked.

Thanks!
Comment 2 Pablo Suárez Hernández 2023-06-29 09:49:57 UTC
BTW I made my tests on TW, not really on Leap. But Salt version should be the same.
Comment 3 Marius Kittler 2023-07-03 09:12:38 UTC
Created attachment 867935 [details]
Logs on sapworker2

Now it happened also on sapworker2 (after it has already happened on sapworker3). Most likely it also happened on sapworker1 but that host was otherwise broken so I couldn't really confirm. I attached logs from sapworker2.

Note that these sap workers are on Leap 15.5.
Comment 4 Oliver Kurz 2023-07-03 11:14:11 UTC
We have all those machines covered in monitoring using a grafana instance on https://monitor.qa.suse.de . We have not observed any OOM condition leading to that during the observed time period.

Please see https://bugzilla.opensuse.org/attachment.cgi?id=867935 for logs.

As Marius Kittler noted the problem is reproducible on Leap 15.5 as well with the according package version pointing to a clear regression since 150400.8.25.1
Comment 5 Oliver Kurz 2023-10-04 07:41:20 UTC
We have observed that multiple machines running Leap 15.5 with salt-3005 show the same problem eventually of "No response". A forced install of the Leap 15.4 salt-3004 package on Leap 15.5 seems to work fine.
Comment 6 Victor Zhestkov 2023-10-04 07:59:11 UTC
@Oliver, could you please check with the latest released version of salt (3006.0-150400.8.44.1).
There is a change (check the changelog):
- Revert usage of long running REQ channel to prevent possible
  missing responses on requests and dublicated responses
  (bsc#1213960, bsc#1213630, bsc#1213257)

It should revert some parts of the refactoring causing such behaviour.
Comment 7 Oliver Kurz 2023-10-04 08:24:57 UTC
(In reply to Victor Zhestkov from comment #6)
> @Oliver, could you please check with the latest released version of salt
> (3006.0-150400.8.44.1).

Nice, will check. This will take some days to see if the "no response" error does not reappear.
Comment 8 Oliver Kurz 2023-10-11 19:56:34 UTC
I have been running salt-minion on multiple hosts since more than a week now with the version you mentioned as fixed, both Leap 15.4 and 15.5 and no more problems observed. So we can consider this bug resolved. Thanks for the support!
Comment 10 Victor Zhestkov 2023-10-11 20:06:35 UTC
Oliver, thank you for the confirmation.

Closing as WORKSFORME as the described behavior is different than the bugs listed for the fix. But this one is the closest one bsc#1213257, if you prefer to set this one as duplicate, you can use it as a reference.