VASP

Build configuration (MKL)

On Elysium, VASP can be built with Spack using:

spack install vasp@6.4.3 +openmp +fftlib ^openmpi@5.0.5 ^fftw@3+openmp ^intel-oneapi-mkl threads=openmp +ilp64

This configuration uses:

  • Intel oneAPI MKL (ILP64) for BLAS, LAPACK and ScaLAPACK,
  • VASP’s internal FFTLIB to avoid MKL CDFT issues on AMD,
  • OpenMPI 5.0.5 as MPI implementation,
  • OpenMP enabled for hybrid parallelisation.

We choose MKL as baseline because it is the de-facto HPC standard and performs well on AMD EPYC when AVX512 code paths are enabled.


Activating AVX512

Intel’s MKL only enables AVX512 optimisations on Intel CPUs. On AMD, MKL defaults to AVX2/SSE code paths.

To unlock the faster AVX512 kernels on AMD EPYC we provide libfakeintel, which fakes Intel CPUID flags.

MKL version library to preload
≤ 2024.x /lib64/libfakeintel.so
≥ 2025.x /lib64/libfakeintel2025.so works for older versions too

⚠ Intel gives no guarantee that all AVX512 instructions work on AMD CPUs. In practice, the community has shown that not every kernel uses full AVX512 width, but the overall speed-up is still substantial.

Activate AVX512 by preloading the library in your job:

export LD_PRELOAD=/lib64/libfakeintel2025.so:${LD_PRELOAD}

Test case 1 – Si256 (DFT / Hybrid HSE06)

This benchmark uses a 256-atom silicon supercell (Si256) with the HSE06 hybrid functional. Hybrid DFT combines FFT-heavy parts with dense BLAS/LAPACK operations and is therefore a good proxy for most large-scale electronic-structure workloads.

Baseline: MPI-only, 1 node

Configuration Time [s] Speed-up vs baseline
MKL (no AVX512) 2367 1.00×
MKL (+ AVX512) 2017 1.17×

→ Always enable AVX512. The baseline DFT case runs 17 % faster with libfakeintel,


Build configuration (AOCL)

AOCL (AMD Optimized Libraries) is AMD’s analogue to MKL, providing:

  • AMDBLIS (BLAS implementation)
  • AMDlibFLAME (LAPACK)
  • AMDScaLAPACK, AMDFFTW optimised for AMD EPYC
  • built with AOCC compiler

Build example:

spack install vasp@6.4.3 +openmp +fftlib %aocc ^amdfftw@5 ^amdblis@5 threads=openmp ^amdlibflame@5 ^amdscalapack@5 ^openmpi

AOCL detects AMD micro-architecture automatically and therefore does not require libfakeintel.

Baseline: MPI-only, 1 node

Configuration Time [s] Speed-up vs baseline
MKL (+ AVX512) 2017 1.00
AOCL (AMD BLIS / libFLAME) 1919 1.05

The AOCl build is another 5% faster than MKL with AVX512 enabled.


Hybrid parallelisation and NUMA domains

Each compute node has two EPYC 9254 CPUs with 24 cores each (48 total). Each CPU is subdivided into 4 NUMA domains with separate L3 caches and memory controllers.

  • MPI-only: 48 ranks per node (1 per core).
  • Hybrid L3: 8 MPI ranks × 6 OpenMP threads each, bound to individual L3 domains.

This L3-hybrid layout increases memory locality, because each rank mainly uses its own local memory and avoids cross-socket traffic.

Single-node hybrid results (Si256)

Configuration Time [s] Speed-up vs MPI-only
MKL (L3 hybrid) 1936 1.04×
AOCL (L3 hybrid) 1830 1.05×

Hybrid L3 adds a modest 4-5 % speed-up.


Multi-node scaling (Si256)

Configuration Nodes Time [s] Speed-up vs 1-node baseline
MKL MPI-only 2 1305 1.55×
AOCL MPI-only 2 1142 1.68×
MKL L3 hybrid 2 1147 1.69×
AOCL L3 hybrid 2 968 1.89×

Interpretation

AOCL shows the strongest scaling across nodes; MKL’s hybrid variant catches up in scaling compared to its MPI-only counterpart. The L3-hybrid layout maintains efficiency even in the multi-node regime.


Recommendations for DFT / Hybrid-DFT workloads

  • AOCL generally outperforms MKL (+AVX512) on AMD EPYC.
  • Prefer L3-Hybrid (8×6) on single-node and even multi-node jobs for FFT-heavy hybrid-DFT cases.
  • For pure MPI runs, both MKL (+AVX512) and AOCL scale well; AOCL slightly better.
  • Always preload libfakeintel2025.so if MKL is used.

Jobscript examples

AOCL – Hybrid L3 (8×6)

#!/bin/bash
#SBATCH -J vasp_aocl_l3hyb
#SBATCH -N 1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH -p cpu
#SBATCH -t 48:00:00
#SBATCH --exclusive

module purge
module load vasp-aocl

export OMP_NUM_THREADS=6
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export BLIS_NUM_THREADS=6

mpirun -np 8 --bind-to l3 --report-bindings vasp_std

MKL (+AVX512) – Hybrid L3 (8×6)

#!/bin/bash
#SBATCH -J vasp_mkl_avx512_l3hyb
#SBATCH -N 1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=6
#SBATCH -p cpu
#SBATCH -t 48:00:00
#SBATCH --exclusive

module purge
module load vasp-mkl
export LD_PRELOAD=/lib64/libfakeintel2025.so:${LD_PRELOAD}

export OMP_NUM_THREADS=6
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export MKL_NUM_THREADS=6
export MKL_DYNAMIC=FALSE

mpirun -np 8 --bind-to l3 --report-bindings vasp_std

Test case 2 – XAS (Core-level excitation)

The XAS Mn-in-ZnO case models a core-level excitation (X-ray Absorption Spectroscopy). These workloads are not FFT-dominated; instead they involve many unoccupied bands and projector evaluations.

Single-node results (XAS)

Configuration Time [s] Relative
MKL MPI-only 897 1.00×
AOCL MPI-only 905 0.99×
MKL L3 hybrid 1202 0.75×
AOCL L3 hybrid 1137 0.79×

Multi-node scaling (XAS)

Configuration Nodes Time [s] Relative
MKL MPI-only 2 1333 0.67×
AOCL MPI-only 2 1309 0.69×
MKL L3 hybrid 2 1366 0.66×
AOCL L3 hybrid 2 1351 0.67×

Interpretation

For core-level / XAS calculations, hybrid OpenMP parallelisation is counter-productive, and scaling beyond one node deteriorates performance due to load imbalance and communication overhead.


Recommendations for XAS and similar workloads

  • Use MPI-only and single-node configuration.
  • MKL and AOCL perform identically within margin of error.
  • Hybrid modes reduce efficiency and should be avoided.
  • Set OMP_NUM_THREADS=1 to avoid unwanted OpenMP activity.

General guidance

For optimal performance on Elysium with AMD EPYC processors, we recommend using the AOCL build as the default choice for all VASP workloads. AOCL consistently outperforms or matches MKL (+AVX512) across tested scenarios (e.g., 5 % faster for Si256 single-node, up to 1.89× speedup for multi-node scaling) and does not require additional configuration like libfakeintel. However, MKL remains a robust alternative, especially for users requiring compatibility with existing workflows.

Workload type Characteristics Recommended setup
Hybrid DFT (HSE06, PBE0, etc.) FFT + dense BLAS, OpenMP beneficial AOCL L3 Hybrid (8×6)
Standard DFT (PBE, LDA) light BLAS, moderate FFT AOCL L3 Hybrid or MPI-only
Core-level / XAS / EELS many unoccupied bands, projectors AOCL MPI-only (single-node)
MD / AIMD (>100 atoms) large FFTs per step AOCL L3 Hybrid
Static small systems (<20 atoms) few bands, small matrices AOCL MPI-only

Recommendations:

  • Default to AOCL: Use the AOCL build for all workloads unless specific constraints (e.g., compatibility with Intel-based tools) require MKL.
  • AVX512 for MKL: If using MKL, always preload libfakeintel2025.so to enable AVX512 optimizations.
  • Benchmark if unsure: Test both MPI-only and L3 Hybrid on one node to determine the optimal configuration for your specific system.