Bugzilla – Bug 1213538
openmpi2 does't work on Westmere processors
Last modified: 2023-07-28 07:49:44 UTC
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.82 Build Identifier: Programs which are compiles by mpic++ of openmpi2-2.1.6-150500.22.3 ara crashed with SIGILL on Xeon E5620 Reproducible: Always Steps to Reproduce: 1. prepare program (foo.cpp) #include <mpi.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); } 2. compile the program mpic++ -g foo.cpp 3. do a.out Actual Results: noritugu@oetesla001:~/test/mpi> ./a.out [oetesla001:16557:0:16557] Caught signal 4 (Illegal instruction: illegal operand) When the program do in the gdb, the error messages are follows: Starting program: /home/noritugu/test/mpi/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [Detaching after fork from child process 16497] [New Thread 0x7ffff70e2700 (LWP 16502)] [New Thread 0x7ffff6751700 (LWP 16503)] [New Thread 0x7fffef31c700 (LWP 16504)] Thread 1 "a.out" received signal SIGILL, Illegal instruction. 0x00007fffefb4bb52 in psm3_hfp_sockets_get_port_subnet (unit=1, port=port@entry=1, addr_index=0, subnet=subnet@entry=0x7fffffffd2b0, addr=addr@entry=0x0, idx=idx@entry=0x0, gid=0x0) at prov/psm3/psm3/hal_sockets/sockets_service.c:433 433 if (subnet) *subnet = psm3_build_ipv4_subnet128(ipv4_addr, ipv4_netmask, ipv4_prefix_len); The results of disas are as follows: 0x00007fffefb4bb3d <+1133>: mov -0xdc(%rbp),%edx 0x00007fffefb4bb43 <+1139>: lea -0x70(%rbp),%rdi 0x00007fffefb4bb47 <+1143>: mov -0xcc(%rbp),%esi 0x00007fffefb4bb4d <+1149>: call 0x7fffefb6a650 <psm3_build_ipv4_subnet128> => 0x00007fffefb4bb52 <+1154>: vmovdqu -0x70(%rbp),%xmm0 0x00007fffefb4bb57 <+1159>: vmovups %xmm0,(%rbx) 0x00007fffefb4bb5b <+1163>: mov -0x60(%rbp),%rax 0x00007fffefb4bb5f <+1167>: mov %rax,0x10(%rbx) 0x00007fffefb4bb63 <+1171>: mov -0xa0(%rbp),%rbx 0x00007fffefb4bb6a <+1178>: test %rbx,%rbx vmovdqu is AVX instruction. Please build openmpi2 package for architecture without AVX.
The issue is in libfabric/PSM3 not in openmpi: 0x00007fffefb4bb52 in psm3_hfp_sockets_get_port_subnet (unit=1, port=port@entry=1, addr_index=0, subnet=subnet@entry=0x7fffffffd2b0, addr=addr@entry=0x0, idx=idx@entry=0x0, gid=0x0) at prov/psm3/psm3/hal_sockets/sockets_service.c:433 However the PSM3 provider for libfabric does require AVX to work. For Westmere, you should try using other openmpi transport layer to avoid using PSM3. You can try: - Disabling libfabric completely by adding --mca btl=^ofi to user mpirun arguments - Disabling the PSM3 provider for libfabric by setting the env var FI_PROVIDER="^psm3" I found a somewhat similar bug opened for libfabric: https://github.com/ofiwg/libfabric/issues/8933 I've added your info, let's see if upstream can clean this up.