Bug 1213538

Summary: openmpi2 does't work on Westmere processors
Product: [openSUSE] openSUSE Distribution Reporter: Yamamoto <n-yamamoto>
Component: OtherAssignee: Nicolas Morey <nicolas.morey>
Status: NEW --- QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None    
Version: Leap 15.5   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Leap 15.5   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Yamamoto 2023-07-21 01:39:03 UTC
User-Agent:       Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.82
Build Identifier: 

Programs which are compiles by mpic++ of openmpi2-2.1.6-150500.22.3 ara crashed with SIGILL on Xeon E5620

Reproducible: Always

Steps to Reproduce:
1. prepare program (foo.cpp)
#include <mpi.h>

int main(int argc, char** argv)
{
MPI_Init(&argc, &argv);
}
2. compile the program
mpic++ -g foo.cpp
3. do a.out
Actual Results:  
noritugu@oetesla001:~/test/mpi> ./a.out
[oetesla001:16557:0:16557] Caught signal 4 (Illegal instruction: illegal operand)



When the program do in the gdb, the error messages are follows:

Starting program: /home/noritugu/test/mpi/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 16497]
[New Thread 0x7ffff70e2700 (LWP 16502)]
[New Thread 0x7ffff6751700 (LWP 16503)]
[New Thread 0x7fffef31c700 (LWP 16504)]

Thread 1 "a.out" received signal SIGILL, Illegal instruction.
0x00007fffefb4bb52 in psm3_hfp_sockets_get_port_subnet (unit=1, port=port@entry=1, addr_index=0, subnet=subnet@entry=0x7fffffffd2b0, addr=addr@entry=0x0, idx=idx@entry=0x0, gid=0x0) at prov/psm3/psm3/hal_sockets/sockets_service.c:433
433                     if (subnet) *subnet = psm3_build_ipv4_subnet128(ipv4_addr, ipv4_netmask, ipv4_prefix_len);

The results of disas are as follows:
   0x00007fffefb4bb3d <+1133>:	mov    -0xdc(%rbp),%edx
   0x00007fffefb4bb43 <+1139>:	lea    -0x70(%rbp),%rdi
   0x00007fffefb4bb47 <+1143>:	mov    -0xcc(%rbp),%esi
   0x00007fffefb4bb4d <+1149>:	call   0x7fffefb6a650 <psm3_build_ipv4_subnet128>
=> 0x00007fffefb4bb52 <+1154>:	vmovdqu -0x70(%rbp),%xmm0
   0x00007fffefb4bb57 <+1159>:	vmovups %xmm0,(%rbx)
   0x00007fffefb4bb5b <+1163>:	mov    -0x60(%rbp),%rax
   0x00007fffefb4bb5f <+1167>:	mov    %rax,0x10(%rbx)
   0x00007fffefb4bb63 <+1171>:	mov    -0xa0(%rbp),%rbx
   0x00007fffefb4bb6a <+1178>:	test   %rbx,%rbx


vmovdqu is AVX instruction.
Please build openmpi2 package for architecture without AVX.
Comment 1 Nicolas Morey 2023-07-28 07:49:44 UTC
The issue is in libfabric/PSM3 not in openmpi:
0x00007fffefb4bb52 in psm3_hfp_sockets_get_port_subnet (unit=1, port=port@entry=1, addr_index=0, subnet=subnet@entry=0x7fffffffd2b0, addr=addr@entry=0x0, idx=idx@entry=0x0, gid=0x0) at prov/psm3/psm3/hal_sockets/sockets_service.c:433

However the PSM3 provider for libfabric does require AVX to work.

For Westmere, you should try using other openmpi transport layer to avoid using PSM3.
You can try:
- Disabling libfabric completely by adding --mca btl=^ofi to user mpirun arguments 
- Disabling the PSM3 provider for libfabric by setting the env var FI_PROVIDER="^psm3"

I found a somewhat similar bug opened for libfabric: https://github.com/ofiwg/libfabric/issues/8933
I've added your info, let's see if upstream can clean this up.