VASP

Build configuration (MKL)

On Elysium you can build vasp with Spack using the following specification (the tarball with the source code needs to be in the directory where the command is executed):

spack install vasp@6.4.3 +openmp +fftlib ^openmpi@5.0.5 ^fftw@3+openmp ^intel-oneapi-mkl threads=openmp +ilp64

This build uses:

  • Intel MKL (ILP64) for BLAS, LAPACK, and ScaLAPACK,
  • FFTLIB (VASP’s internal FFT library) to avoid MKL-CDFT issues on AMD,
  • OpenMPI 5.0.5 as MPI implementation,
  • OpenMP threading enabled.

We chose this configuration because MKL is the most widely used library in HPC and provides high performance on AMD CPUs when combined with libfakeintel.


libfakeintel

Intel’s MKL library enables optimized AVX512 code paths only on Intel CPUs. On AMD processors, MKL by default falls back to AVX2 or SSE code paths.

To enable AVX512 instructions also on AMD CPUs, we provide libfakeintel system-wide. Users can activate it by preloading the library in their job scripts:

export LD_PRELOAD=/lib64/libfakeintel.so:$LD_PRELOAD

In the results table below, the column MKL+AVX512 refers to runs with libfakeintel preloaded.


Build configuration (AOCL)

We also tested a build using AMD’s AOCC compiler and AOCL libraries:

spack install vasp@6.4.3 +openmp +fftlib %aocc  ^amdfftw@5 ^amdblis@5 threads=openmp ^amdlibflame@5 ^amdscalapack@5 ^openmpi

This build replaces Intel MKL with AMD’s AOCL stack (BLIS, libFLAME, ScaLAPACK, FFTW). AOCL does not require libfakeintel since its dispatch mechanism already enables optimal code paths on AMD CPUs.


Parallelization strategies and NUMA domains

Each compute node on Elysium has two CPUs (sockets) with 24 cores each. A CPU socket is further subdivided into NUMA domains that group cores sharing the same memory controller.

We tested two hybrid parallelization layouts:

  • 2 MPI ranks × 24 threads each: one rank bound to each CPU socket. This ensures that each rank stays within its local socket, but all 24 threads share the same memory domain.
  • 8 MPI ranks × 6 threads each: one rank bound to each L3 cache/NUMA domain (6 cores). This increases memory locality, because each rank mainly uses its own local memory region instead of accessing data from another domain. In principle this reduces memory traffic and improves performance.

In addition we tested the MPI-only mode with one MPI rank per core (48 ranks per node).


Single-node job (1 node, 48 cores)

Baseline: MKL MPI-only (2362 s = 39:22).

Configuration MKL [s] Speedup MKL+AVX512 [s] Speedup AOCL [s] Speedup Fastest Variant
MPI-only 2362 1.00× 2013 1.17× 1923 1.23× AOCL
Hybrid Socket (2×24) 5065 0.47× 4256 0.55× 6971 0.34× MKL+AVX512
Hybrid L3 (8×6) 2243 1.05× 1933 1.22× 1880 1.26× AOCL

Multi-node results (2 nodes, 96 cores)

In a 2-node run (96 cores), the single-node trend reverses: MPI-only becomes the fastest choice, slightly favoring AOCL over MKL+AVX512, while hybrid L3 underperforms for both stacks.

Baseline: MKL+AVX512 MPI-only (21:48 = 1308 s).

Configuration MKL+AVX512 [s] Speedup AOCL [s] Speedup Fastest Variant
MPI-only 1308 (21:48) 1.00× 1294 (21:34) 1.01× AOCL
Hybrid L3 (8×6 per node) 1356 (22:36) 0.96× 1696 (28:16) 0.77× MKL+AVX512

Why hybrid tends to underperform on multiple nodes

Across nodes collective communications weigh more heavily. MPI-only keeps a higher rank count yielding smaller per-rank messages. Hybrid reduces the rank count and shifts work into OpenMP regions, which increases per-rank message sizes in collectives over the network.


Recommendation

Note that these recommendations are based on a few selected inputs. Your inputs might behave differently and should be benchmarked and tested accordingly in order to find the most performant setup.

  • Single-node (1×48 cores):

    • AOCL Hybrid L3 is currently the fastest (~3% ahead of MKL+AVX512 Hybrid L3).
    • AOCL MPI-only is also strong and MKL gains noticeably from libfakeintel.
    • Socket-hybrid is not recommended.
  • Multi-node (2×48 cores = 96 cores):

    • MPI-only performs best in our test.
    • AOCL MPI-only edges MKL+AVX512 MPI-only slightly.
    • Hybrid L3 falls behind on both stacks (particularly AOCL).

Practical guidance

  • Start with AOCL Hybrid L3 on single-node jobs.
  • Start with MPI-only on multi-node jobs (≥2 nodes).
  • Always benchmark your own inputs.

Jobscripts

Example Jobscript for AOCL build

#!/bin/bash
#SBATCH -J vasp_aocl_hyb_l3_8x6
#SBATCH -N 1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH -p cpu
#SBATCH -t 48:00:00
#SBATCH --exclusive

module purge
module load vasp-aocl

export OMP_NUM_THREADS=6
export OMP_PLACES=cores
export OMP_PROC_BIND=close

export BLIS_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export FFTW_NUM_THREADS=1

mpirun -np 8 --bind-to l3 --report-bindings vasp_std

Example Jobscript for MKL build

#!/bin/bash
#SBATCH -J vasp_mkl_hyb_l3_8x6
#SBATCH -N 1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH -p cpu
#SBATCH -t 48:00:00
#SBATCH --exclusive

module purge
module load vasp-mkl

export LD_PRELOAD=/lib64/libfakeintel.so:${LD_PRELOAD}

export OMP_NUM_THREADS=6
export OMP_PLACES=cores
export OMP_PROC_BIND=close

export MKL_NUM_THREADS=1
export MKL_DYNAMIC=FALSE

mpirun -np 8 --bind-to l3 --report-bindings vasp_std