|
Bugzilla – Full Text Bug Listing |
| Summary: | openmpi2 does't work on Westmere processors | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Distribution | Reporter: | Yamamoto <n-yamamoto> |
| Component: | Other | Assignee: | Nicolas Morey <nicolas.morey> |
| Status: | NEW --- | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | ||
| Version: | Leap 15.5 | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE Leap 15.5 | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
The issue is in libfabric/PSM3 not in openmpi: 0x00007fffefb4bb52 in psm3_hfp_sockets_get_port_subnet (unit=1, port=port@entry=1, addr_index=0, subnet=subnet@entry=0x7fffffffd2b0, addr=addr@entry=0x0, idx=idx@entry=0x0, gid=0x0) at prov/psm3/psm3/hal_sockets/sockets_service.c:433 However the PSM3 provider for libfabric does require AVX to work. For Westmere, you should try using other openmpi transport layer to avoid using PSM3. You can try: - Disabling libfabric completely by adding --mca btl=^ofi to user mpirun arguments - Disabling the PSM3 provider for libfabric by setting the env var FI_PROVIDER="^psm3" I found a somewhat similar bug opened for libfabric: https://github.com/ofiwg/libfabric/issues/8933 I've added your info, let's see if upstream can clean this up. |
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.82 Build Identifier: Programs which are compiles by mpic++ of openmpi2-2.1.6-150500.22.3 ara crashed with SIGILL on Xeon E5620 Reproducible: Always Steps to Reproduce: 1. prepare program (foo.cpp) #include <mpi.h> int main(int argc, char** argv) { MPI_Init(&argc, &argv); } 2. compile the program mpic++ -g foo.cpp 3. do a.out Actual Results: noritugu@oetesla001:~/test/mpi> ./a.out [oetesla001:16557:0:16557] Caught signal 4 (Illegal instruction: illegal operand) When the program do in the gdb, the error messages are follows: Starting program: /home/noritugu/test/mpi/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". [Detaching after fork from child process 16497] [New Thread 0x7ffff70e2700 (LWP 16502)] [New Thread 0x7ffff6751700 (LWP 16503)] [New Thread 0x7fffef31c700 (LWP 16504)] Thread 1 "a.out" received signal SIGILL, Illegal instruction. 0x00007fffefb4bb52 in psm3_hfp_sockets_get_port_subnet (unit=1, port=port@entry=1, addr_index=0, subnet=subnet@entry=0x7fffffffd2b0, addr=addr@entry=0x0, idx=idx@entry=0x0, gid=0x0) at prov/psm3/psm3/hal_sockets/sockets_service.c:433 433 if (subnet) *subnet = psm3_build_ipv4_subnet128(ipv4_addr, ipv4_netmask, ipv4_prefix_len); The results of disas are as follows: 0x00007fffefb4bb3d <+1133>: mov -0xdc(%rbp),%edx 0x00007fffefb4bb43 <+1139>: lea -0x70(%rbp),%rdi 0x00007fffefb4bb47 <+1143>: mov -0xcc(%rbp),%esi 0x00007fffefb4bb4d <+1149>: call 0x7fffefb6a650 <psm3_build_ipv4_subnet128> => 0x00007fffefb4bb52 <+1154>: vmovdqu -0x70(%rbp),%xmm0 0x00007fffefb4bb57 <+1159>: vmovups %xmm0,(%rbx) 0x00007fffefb4bb5b <+1163>: mov -0x60(%rbp),%rax 0x00007fffefb4bb5f <+1167>: mov %rax,0x10(%rbx) 0x00007fffefb4bb63 <+1171>: mov -0xa0(%rbp),%rbx 0x00007fffefb4bb6a <+1178>: test %rbx,%rbx vmovdqu is AVX instruction. Please build openmpi2 package for architecture without AVX.