VASP
Build configuration (MKL)
On Elysium you can build vasp
with Spack using the following specification (the tarball with the source code needs to be in the directory where the command is executed):
spack install vasp@6.4.3 +openmp +fftlib ^openmpi@5.0.5 ^fftw@3+openmp ^intel-oneapi-mkl threads=openmp +ilp64
This build uses:
- Intel MKL (ILP64) for BLAS, LAPACK, and ScaLAPACK,
- FFTLIB (VASP’s internal FFT library) to avoid MKL-CDFT issues on AMD,
- OpenMPI 5.0.5 as MPI implementation,
- OpenMP threading enabled.
We chose this configuration because MKL is the most widely used library in HPC and provides high performance on AMD CPUs when combined with libfakeintel
.
libfakeintel
Intel’s MKL library enables optimized AVX512 code paths only on Intel CPUs. On AMD processors, MKL by default falls back to AVX2 or SSE code paths.
To enable AVX512 instructions also on AMD CPUs, we provide libfakeintel system-wide. Users can activate it by preloading the library in their job scripts:
export LD_PRELOAD=/lib64/libfakeintel.so:$LD_PRELOAD
In the results table below, the column MKL+AVX512 refers to runs with libfakeintel preloaded.
Build configuration (AOCL)
We also tested a build using AMD’s AOCC compiler and AOCL libraries:
spack install vasp@6.4.3 +openmp +fftlib %aocc ^amdfftw@5 ^amdblis@5 threads=openmp ^amdlibflame@5 ^amdscalapack@5 ^openmpi
This build replaces Intel MKL with AMD’s AOCL stack (BLIS, libFLAME, ScaLAPACK, FFTW). AOCL does not require libfakeintel since its dispatch mechanism already enables optimal code paths on AMD CPUs.
Parallelization strategies and NUMA domains
Each compute node on Elysium has two CPUs (sockets) with 24 cores each. A CPU socket is further subdivided into NUMA domains that group cores sharing the same memory controller.
We tested two hybrid parallelization layouts:
- 2 MPI ranks × 24 threads each: one rank bound to each CPU socket. This ensures that each rank stays within its local socket, but all 24 threads share the same memory domain.
- 8 MPI ranks × 6 threads each: one rank bound to each L3 cache/NUMA domain (6 cores). This increases memory locality, because each rank mainly uses its own local memory region instead of accessing data from another domain. In principle this reduces memory traffic and improves performance.
In addition we tested the MPI-only mode with one MPI rank per core (48 ranks per node).
Single-node job (1 node, 48 cores)
Baseline: MKL MPI-only (2362 s = 39:22).
Configuration | MKL [s] | Speedup | MKL+AVX512 [s] | Speedup | AOCL [s] | Speedup | Fastest Variant |
---|---|---|---|---|---|---|---|
MPI-only | 2362 | 1.00× | 2013 | 1.17× | 1923 | 1.23× | AOCL |
Hybrid Socket (2×24) | 5065 | 0.47× | 4256 | 0.55× | 6971 | 0.34× | MKL+AVX512 |
Hybrid L3 (8×6) | 2243 | 1.05× | 1933 | 1.22× | 1880 | 1.26× | AOCL |
Multi-node results (2 nodes, 96 cores)
In a 2-node run (96 cores), the single-node trend reverses: MPI-only becomes the fastest choice, slightly favoring AOCL over MKL+AVX512, while hybrid L3 underperforms for both stacks.
Baseline: MKL+AVX512 MPI-only (21:48 = 1308 s).
Configuration | MKL+AVX512 [s] | Speedup | AOCL [s] | Speedup | Fastest Variant |
---|---|---|---|---|---|
MPI-only | 1308 (21:48) | 1.00× | 1294 (21:34) | 1.01× | AOCL |
Hybrid L3 (8×6 per node) | 1356 (22:36) | 0.96× | 1696 (28:16) | 0.77× | MKL+AVX512 |
Why hybrid tends to underperform on multiple nodes
Across nodes collective communications weigh more heavily. MPI-only keeps a higher rank count yielding smaller per-rank messages. Hybrid reduces the rank count and shifts work into OpenMP regions, which increases per-rank message sizes in collectives over the network.
Recommendation
Note that these recommendations are based on a few selected inputs. Your inputs might behave differently and should be benchmarked and tested accordingly in order to find the most performant setup.
-
Single-node (1×48 cores):
- AOCL Hybrid L3 is currently the fastest (~3% ahead of MKL+AVX512 Hybrid L3).
- AOCL MPI-only is also strong and MKL gains noticeably from libfakeintel.
- Socket-hybrid is not recommended.
-
Multi-node (2×48 cores = 96 cores):
- MPI-only performs best in our test.
- AOCL MPI-only edges MKL+AVX512 MPI-only slightly.
- Hybrid L3 falls behind on both stacks (particularly AOCL).
Practical guidance
- Start with AOCL Hybrid L3 on single-node jobs.
- Start with MPI-only on multi-node jobs (≥2 nodes).
- Always benchmark your own inputs.
Jobscripts
Example Jobscript for AOCL build
#!/bin/bash
#SBATCH -J vasp_aocl_hyb_l3_8x6
#SBATCH -N 1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH -p cpu
#SBATCH -t 48:00:00
#SBATCH --exclusive
module purge
module load vasp-aocl
export OMP_NUM_THREADS=6
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export BLIS_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export FFTW_NUM_THREADS=1
mpirun -np 8 --bind-to l3 --report-bindings vasp_std
Example Jobscript for MKL build
#!/bin/bash
#SBATCH -J vasp_mkl_hyb_l3_8x6
#SBATCH -N 1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH -p cpu
#SBATCH -t 48:00:00
#SBATCH --exclusive
module purge
module load vasp-mkl
export LD_PRELOAD=/lib64/libfakeintel.so:${LD_PRELOAD}
export OMP_NUM_THREADS=6
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export MKL_NUM_THREADS=1
export MKL_DYNAMIC=FALSE
mpirun -np 8 --bind-to l3 --report-bindings vasp_std