Bug 1214983 - slurm segfault with AccountingStorageExternalHost
Summary: slurm segfault with AccountingStorageExternalHost
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise HPC 15 SP5
Classification: openSUSE
Component: Cluster Tool (show other bugs)
Version: unspecified
Hardware: Other openSUSE Leap 15.5
: P5 - None : Normal
Target Milestone: ---
Assignee: Christian Goll
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-05 09:07 UTC by Michael Curtis
Modified: 2023-10-11 13:09 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Curtis 2023-09-05 09:07:40 UTC
Steps to reproduce
------------------
1. Configure slurmctld with a valid AccountingStorageExternalHost 
2. Start slurmctld.  Daemon segfaults as immediately after attempting to conenct to the external accounting server

Actual Results
---------------
SEGFAULT.  Logs on the remote (the server running the "External Host", i.e. slurmdbd) show two inbound connections from the affected slurmctld instance

Expected Results
-----------------
slurmctld connects to the external accounting server and starts successfully.  This mode of operation is required to enable federation between clusters, for example a cloud-based cluster and a local cluster.

Build
-----
openSuSE Leap 15.5

Workarounds
-----------

This is fixed upstream in slurm 23.02.4, and is a regression from 20.11.9 in SP3/SP4 (we did not test this configuration with 22.05.5). We successfully updated the package to build this upstream version and that allowed us to complete the upgrade to 15.5/15 SP5 without further issue (no problems observed in UAT).

This patched version is here:
 https://build.opensuse.org/package/show/home:mdcurtis:branches:network:cluster/slurm

Alternatively, it seems this upstream commit is intended to solve the problem, and could perhaps be applied to 23.02.2 directly:
 https://github.com/SchedMD/slurm/commit/833ca8dd2121a2c980736c05821608324c7ae97a

However, I haven't tested that latter approach.  It seems the root cause is connecting to the external cluster twice, corrupting the slurmctld internal data structures.
Comment 1 Christian Goll 2023-09-05 11:52:20 UTC
Thanks for your detailed bug report and the link to your fixed repo.

I have update the package for Factory/Tumbleweed and will push an update to SLE15SP5 which will then be also published for openSUSE Leap 15.5.
Comment 2 OBSbugzilla Bot 2023-09-05 12:25:02 UTC
This is an autogenerated message for OBS integration:
This bug (1214983) was mentioned in
https://build.opensuse.org/request/show/1109029 Factory / slurm
Comment 4 OBSbugzilla Bot 2023-09-06 17:45:14 UTC
This is an autogenerated message for OBS integration:
This bug (1214983) was mentioned in
https://build.opensuse.org/request/show/1109308 Factory / slurm
Comment 7 OBSbugzilla Bot 2023-09-11 10:35:02 UTC
This is an autogenerated message for OBS integration:
This bug (1214983) was mentioned in
https://build.opensuse.org/request/show/1110259 Factory / slurm
Comment 9 Egbert Eich 2023-09-11 16:20:49 UTC
QA-HINT: The AccountingStorageExternalHost feature is quite complex to test as it requires to fully configured clusters. Furthermore, we are not fully clear about all implications of this feature. We therefore should refrain from trying to create a repoducer.
Comment 10 Maintenance Automation 2023-09-25 16:30:05 UTC
SUSE-RU-2023:3761-1: An update that has one fix can now be installed.

Category: recommended (moderate)
Bug References: 1214983
Sources used:
SUSE Linux Enterprise High Performance Computing 15 SP1 LTSS 15-SP1 (src): slurm_23_02-23.02.4-150100.3.8.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 11 Maintenance Automation 2023-09-25 16:30:06 UTC
SUSE-RU-2023:3760-1: An update that has one fix can now be installed.

Category: recommended (moderate)
Bug References: 1214983
Sources used:
SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (src): slurm_23_02-23.02.4-150200.5.8.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 12 Maintenance Automation 2023-09-25 16:30:09 UTC
SUSE-RU-2023:3759-1: An update that has one fix can now be installed.

Category: recommended (moderate)
Bug References: 1214983
Sources used:
openSUSE Leap 15.5 (src): slurm-23.02.4-150500.5.6.1
HPC Module 15-SP5 (src): slurm-23.02.4-150500.5.6.1
SUSE Package Hub 15 15-SP5 (src): slurm-23.02.4-150500.5.6.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 13 Maintenance Automation 2023-09-25 16:30:12 UTC
SUSE-RU-2023:3758-1: An update that has one fix can now be installed.

Category: recommended (moderate)
Bug References: 1214983
Sources used:
openSUSE Leap 15.4 (src): slurm_23_02-23.02.4-150300.7.8.1
HPC Module 15-SP4 (src): slurm_23_02-23.02.4-150300.7.8.1
SUSE Linux Enterprise High Performance Computing ESPOS 15 SP3 (src): slurm_23_02-23.02.4-150300.7.8.1
SUSE Linux Enterprise High Performance Computing LTSS 15 SP3 (src): slurm_23_02-23.02.4-150300.7.8.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 14 Maintenance Automation 2023-09-28 12:32:22 UTC
SUSE-FU-2023:3860-1: An update that contains one feature and has seven fixes can now be installed.

Category: feature (moderate)
Bug References: 1088693, 1206795, 1208846, 1209216, 1209260, 1212946, 1214983
Jira References: PED-2987
Sources used:
HPC Module 12 (src): pdsh_slurm_20_02-2.34-7.41.2, pdsh-2.34-7.41.2, pdsh_slurm_23_02-2.34-7.41.4, pdsh_slurm_20_11-2.34-7.41.2, pdsh_slurm_18_08-2.34-7.41.2, slurm_23_02-23.02.4-3.7.1, pdsh_slurm_22_05-2.34-7.41.2

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 15 Christian Goll 2023-10-11 13:09:06 UTC
Fixed with slurm_23_02