Linux / arm64
Linux / amd64
The NVIDIA HPC-Benchmarks collection provides four benchmarks (HPL, HPL-MxP, HPCG, and STREAM) widely used in the HPC community optimized for performance on NVIDIA accelerated HPC systems.
NVIDIA's HPL and HPL-MxP benchmarks provide software packages to solve a (random) dense linear system in double precision (64-bit) arithmetic and in mixed precision arithmetic using Tensor Cores, respectively, on distributed-memory computers equipped with NVIDIA GPUs, based on the Netlib HPL benchmark and HPL-MxP benchmark.
NVIDIA's HPCG benchmark accelerates the High Performance Conjugate Gradients (HPCG) Benchmark. HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64-bit) floating point values.
NVIDIA's STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth. NVIDIA HPC-Benchmarks container includes STREAM benchmarks optimized for NVIDIA Ampere GPU architecture (sm80), NVIDIA Hopper GPU architecture (sm90) and NVIDIA Grace CPU.
The NVIDIA HPC-Benchmarks collection provides a multiplatform (x86 and aarch64) container image hpc-benchmarks:24.09
based on NVIDIA Optimized Frameworks 24.08 container image.
In addition to NVIDIA Optimized Frameworks 24.09 container images, the hpc-benchmarks:24.09
container image is provided with the following packages embedded:
NVIDIA HPC-Benchmarks for MPI libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.) and OpenMPI available on NVIDIA.DEVELOPER.
Using the NVIDIA HPC-Benchmarks Container requires the host system to have the following installed:
For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation
The NVIDIA HPC-Benchmarks Container supports both the NVIDIA Ampere GPU architecture (sm80) and the NVIDIA Hopper GPU architecture (sm90). This version of the container supports clusters featuring DGX A100, DGX H100, NVIDIA Grace Hopper, and NVIDIA Grace CPU nodes. Previous GPU generations are not expected to be compatible.
The hpc-benchmarks:24.09
container provides the NVIDIA HPL
, NVIDIA HPL-MxP
, NVIDIA HPCG
and NVIDIA STREAM
benchmarks in the following folder structure:
hpl.sh
script in the folder /workspace
to invoke the xhpl
executable.hpl-mxp.sh
script in the folder /workspace
to invoke the xhpl-mxp
executable.hpcg.sh
script in the folder /workspace
to invoke the xhpcg
executable.stream-gpu-test.sh
script in the folder /workspace
to invoke the stream_test
executable for NVIDIA H100 or A100 GPU.NVIDIA HPL
in the folder /workspace/hpl-linux-x86_64
contains:
xhpl
executable.sample-slurm
directory.sample-dat
directory.NVIDIA HPL-MxP
in the folder /workspace/hpl-mxp-linux-x86_64
contains:
xhpl_mxp
executable.sample-slurm
directory.NVIDIA HPCG
in the folder /workspace/hpcg-linux-x86_64
contains:
xhpcg
executable.sample-slurm
directorysample-dat
directory.NVIDIA STREAM
in the folder /workspace/stream-gpu-linux-x86_64
stream_test
executable. GPU STREAM benchmark with double precision elements.stream_test_fp32
executable. GPU STREAM benchmark with single precision elements.hpl-aarch64.sh
script in the folder /workspace
to invoke the xhpl
executable for NVIDIA Grace CPU.hpl.sh
script in the folder /workspace
to invoke the xhpl
executable for NVIDIA Grace Hopper.hpl-mxp-aarch64.sh
script in the folder /workspace
to invoke the xhpl-mxp
executable NVIDIA Grace CPU.hpl-mxp.sh
script in the folder /workspace
to invoke the xhpl-mxp
executable for NVIDIA Grace Hopper.hpcg-aarch64.sh
script in the folder /workspace
to invoke the xhpcg
executables for NVIDIA Grace Hopper and Grace CPU.stream-test-cpu.sh
script in the folder /workspace
to invoke the stream_test
executable NVIDIA Grace CPU.stream-test-gpu.sh
script in the folder /workspace
to invoke the stream_test
executable for NVIDIA Grace Hopper.NVIDIA HPL
in the folder /workspace/hpl-linux-aarch64
contains:
xhpl
executable for NVIDIA Grace CPU.sample-slurm
directory.sample-dat
directory.NVIDIA HPL
in the folder /workspace/hpl-linux-aarch64-gpu
contains:
xhpl
executable for NVIDIA Grace Hopper.sample-slurm
directory.sample-dat
directory.NVIDIA HPL-MxP
in the folder /workspace/hpl-mxp-linux-aarch64
contains:
xhpl_mxp
executable for NVIDIA Grace CPU.sample-slurm
directory.NVIDIA HPL-MxP
in the folder /workspace/hpl-mxp-linux-aarch64-gpu
contains:
xhpl_mxp
executable for NVIDIA Grace Hopper.sample-slurm
directory.NVIDIA HPCG
in the folder /workspace/hpcg-linux-aarch64
contains:
xhpcg
executable for NVIDIA Grace Hopper.xhpcg-cpu
executable for NVIDIA Grace CPU.sample-slurm
directorysample-dat
directory.NVIDIA STREAM
in the folder /workspace/stream-gpu-linux-aarch64
stream_test
executable. GPU STREAM benchmark with double precision elements.stream_test_fp32
executable. GPU STREAM benchmark with single precision elements.NVIDIA STREAM
in the folder /workspace/stream-cpu-linux-aarch64
stream_test
executable. NVIDAI Grace CPU STREAM benchmark with double precision elements.The NVIDIA HPL
benchmark uses the same input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark to get started with the HPL software concepts and best practices.
The NVIDIA HPCG
benchmark uses the same input format as the standard HPCG-Benchmark. Please see the HPCG-Benchmark to get started with the HPCG software concepts and best practices.
The NVIDIA HPL-MxP
benchmark accepts a list of parameters to describe input task and set additional tuning settings. The description of parameters can be found in the README and TUNING files.
The NVIDIA HPL
, NVIDIA HPL-MxP
, and NVIDIA HPCG
benchmarks with GPU support require one GPU per MPI process. Therefore, ensure that the number of MPI processes is set to match the number of available GPUs in the cluster.
Version 24.09 of the NVIDIA HPL
benchmark supports an 'out-of-core' mode. This is an opt-in feature and the default mode remains the 'in-core' mode.
The NVIDIA HPL
out-of-core mode enables the use of larger matrix sizes. Unlike the in-core mode, any matrix data that exceeds GPU memory capacity is automatically stored in the host CPU memory. To activate this feature, simply set the environment variable HPL_OOC_MODE=1
and specify a larger matrix size (e.g., using the N parameter in the input file).
Performance will depend on host-device transfer speeds. For best performance, try to keep the amount of host memory used for the matrix to around 6-16 GiB on platforms where the CPU and GPU are connected via PCIe (such as x86). For systems where there is a faster CPU-GPU interconnect (such as Grace Hopper), sizes greater than 16 GiB may be beneficial. A method to estimate the matrix size for this feature is to take the largest per GPU memory size used with NVIDIA HPL
in-core mode, add the target amount of host data, and then work out the new matrix size from this total size.
All the environment variables needed by the NVIDIA HPL
out-of-core mode can be found in the provided /workspace/hpl-linux-x86_64/TUNING
or /workspace/hpl-linux-aarch64-gpu/TUNING
files.
If NVIDIA HPL
out-of-core mode is enabled, it is highly recommended to pass the CPU, GPU, and memory affinity arguments to hpl.sh
.
The scripts hpl.sh
and hpcg.sh
can be invoked on a command line or through a Slurm batch-script to launch the NVIDIA HPL
and NVIDIA HPCG
benchmarks, respectively. The scripts hpl.sh
and hpcg.sh
accept the following parameters:
--dat
path to HPL.dat.
Optional parameters:--gpu-affinity <string>
colon separated list of GPU indices--cpu-affinity <string>
colon separated list of CPU index ranges--mem-affinity <string>
colon separated list of memory indices--ucx-affinity <string>
colon separated list of UCX devices--ucx-tls <string>
UCX transport to use--exec-name <string>
HPL executable file--no-multinode
enable flags for no-multinode (no-network) execution (HPL only)In addition, the script hpcg.sh
alternatively to input file accepts the following parameters:
--nx
specifies the local (to an MPI process) X dimensions of the problem--ny
specifies the local (to an MPI process) Y dimensions of the problem--nz
specifies the local (to an MPI process) Z dimensions of the problem--rt
specifies the number of seconds of how long the timed portion of the benchmark should run--b
activates benchmarking mode to bypass CPU reference execution when set to one (--b 1)--l2cmp
activates compression in GPU L2 cache when set to one (--l2cmp 1)--of
activates generating the log into textfiles, instead of console (--of=1)--gss
specifies the slice size for the GPU rank (default is 2048)--p2p
specifies the p2p comm mode: 0 MPI_CPU, 1 MPI_CPU_All2allv, 2 MPI_CUDA_AWARE, 3 MPI_CUDA_AWARE_All2allv, 4 NCCL. Default MPI_CPU--npx
specifies the process grid X dimension of the problem--npy
specifies the process grid Y dimension of the problem--npz
specifies the process grid Z dimension of the problemThe script hpl-mxp.sh
can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark
. The script hpl-mxp.sh
requires the following parameters:
--gpu-affinity <string>
colon separated list of GPU indices--nprow <int>
number of rows in the processor grid"--npcol <int>
number of columns in the processor grid"--nporder <string>
"row" or "column" major layout of the processor grid"--n <int>
size of N-by-N matrix--nb <int>
nb is the blocking constant (panel size)"
The full list of accepted parameters can be found in README and TUNING files.Note:
NVIDIA HPL-MxP
benchmark. Below is an example for DGX A100 and DGX H100:--mem-affinity 0:0:0:0:1:1:1:1
--cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
--mem-affinity 2:3:0:1:6:7:4:5
--cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
The script stream-gpu-test.sh
can be invoked on a command line or through a Slurm batch script to launch the NVIDIA STREAM benchmark
. The script stream-gpu-test.sh
accepts the following optional parameters:
--d <int>
device number--n <int>
number of elements in the arrays--dt fp32
enable fp32 stream test--t <string>
tests which will be executed, can be any combination of:C
- COPY testS
- SCALE testA
- ADD testT
- TRAID test
for example, value --t CST
means that COPY, SCALE, and TRIAD tests will be executed. Default value CSAT
NVIDIA HPL
, NVIDIA HPCG
, NVIDIA HPL-MxP
, and NVIDIA STREAM
GPU benchmarks on aarch64 can be run similarly to those on x86_64 (see details in x86 container image
section).
This section provides sample runs of NVIDIA HPL
, NVIDIA HPL-MxP
, and NVIDIA HPCG
benchmarks for NVIDIA Grace CPU.
The scripts hpl-aarch64.sh
and hpcg-aarch64.sh
can be invoked either from the command line or through a Slurm batch-script to launch the NVIDIA HPL
and NVIDIA HPCG
benchmarks for NVIDIA Grace CPU, respectively.
The scripts hpl-aarch64.sh
and hpcg-aarch64.sh
accept the following parameters:
--dat
path to HPL.dat.
Optional parameters:--cpu-affinity <string>
colon separated list of CPU index ranges--mem-affinity <string>
colon separated list of memory indices--ucx-affinity <string>
colon separated list of UCX devices--ucx-tls <string>
UCX transport to use--exec-name <string>
HPL executable file--no-multinode
enable flags for no-multinode (no-network) execution
Note: It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
In addition, instead of an input file, the script hpcg-aarch64.sh
accepts the following parameters:
--nx
specifies the local (to an MPI process) X dimensions of the problem--ny
specifies the local (to an MPI process) Y dimensions of the problem--nz
specifies the local (to an MPI process) Z dimensions of the problem--rt
specifies the number of seconds of how long the timed portion of the benchmark should run--b
activates benchmarking mode to bypass CPU reference execution when set to one (--b=1)--l2cmp
activates compression in GPU L2 cache when set to one (--l2cmp=1)--of
activates generating the log into textfiles, instead of console (--of=1)--gss
specifies the slice size for the GPU rank (default is 2048)--css
specifies the slice size for the CPU rank (default is 8)
The following parameters control the NVIDIA-HPCG benchmark on Grace-Hopper systems:--exm
specifies the execution mode. 0 is GPU-only, 1 is Grace-only, and 2 is GPU-Grace. Default is 0--ddm
specifies the direction that GPU and Grace will not have the same local dimension. 0 is auto, 1 is X, 2 is Y, and 3 is Z. Default is 0. Note that the GPU and Grace local problems can differ in one dimension only--lpm
controls the meaning of the value provided for --g2c
parameter. Applicable when --exm
is 2 and depends on the different local dimension specified by --ddm
Value Explanation:--nx 128
--ny 128
--nz 128
--ddm 2
--g2c 8
means the different Grace dim (Y in this example) is 1/8 the different GPU dim. GPU local problem is 128x128x128 and Grace local problem is 128x16x128.--nx 128 --ny 128 --nz 128 --ddm 3 --g2c 64
means the different Grace dim (Z in this example) is 64. GPU local problem is 128x128x128 and Grace local problem is 128x128x64.--g2c
is the ratio. For example, --ddm 1, --nx 1024, and --g2c 8
, then GPU X dim is 896 and Grace X dim is 128.--g2c
is absolute. For example, --ddm 1, --nx 1024, and --g2c 96
then GPU X dim is 928 and Grace X dim is 96.--g2c
specifies the value of different dimensions of the GPU and Grace local problems. Depends on --ddm
and --lpm
values.Optional parameters of hpcg-aarch64.sh
script:
--p2p
specifies the p2p comm mode: 0 MPI_CPU, 1 MPI_CPU_All2allv, 2 MPI_CUDA_AWARE, 3 MPI_CUDA_AWARE_All2allv, 4 NCCL. Default MPI_CPU--npx
specifies the process grid X dimension of the problem--npy
specifies the process grid Y dimension of the problem--npz
specifies the process grid Z dimension of the problem--gpu-affinity
colon separated list of gpu indices--cpu-affinity
colon separated list of cpu index ranges--mem-affinity
colon separated list of memory indices--ucx-affinity
colon separated list of UCX devices--ucx-tls
UCX transport to use--exec-name
HPCG executable file--cuda-compat
manually enable CUDA forward compatibilityThe script hpl-mxp-aarch64.sh
can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark
for NVIDIA Grace CPU. The script hpl-mxp-aarch64.sh
requires the following parameters:
--nprow <int>
number of rows in the processor grid--npcol <int>
number of columns in the processor grid--nporder <string>
"row" or "column" major layout of the processor grid--n <int>
size of N-by-N matrix--nb <int>
nb is the blocking constant (panel size)
The full list of accepted parameters can be found in README and TUNING files.The script stream-cpu-test.sh
can be invoked on a command line or through a Slurm batch script to launch the NVIDIA STREAM benchmark
. The script stream-cpu-test.sh
accepts the following optional parameters:
--n <int>
number of elements in the arrays--t <int>
number of threadsFor a general guide on pulling and running containers, see Running A Container chapter in the NVIDIA Containers For Deep Learning Frameworks User’s Guide. For more information about using NGC, refer to the NGC Container User Guide.
NVIDIA HPL takes several runtime environment variables to improve the performance on different platforms.
HPL_P2P_AS_BCAST
: Which communication library to use in the final solve step.
- Default Value: 1
- Possible Values: 0 (NCCL bcast), 1 (NCCL send/recv), 2 (CUDA-aware MPI), 3 (host MPI), 4 (NVSHMEM)
HPL_USE_NVSHMEM
: Enables/disables NVSHMEM support in HPL.
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)
HPL_FCT_COMM_POLICY
: Which communication library to use in the panel factorization.
- Default Value: 1
- Possible Values: 0 (NVSHMEM), 1 (host MPI)
HPL_NVSHMEM_SWAP
: Performs row swaps using NVSHMEM instead of NCCL (default) Number of matrix blocks (size NB) to group for computations.
- Default Value: 0
- Possible Values: 1 (enable), 0 (disable)
HPL_CHUNK_SIZE_NBS
: Number of matrix blocks (size NB) to group for computations.
- Default Value: 16
- Possible Values: >0
HPL_DIST_TRSM_FLAG
: Perform the solve step (TRSM) in parallel rather than on only the ranks that own that part of the matrix.
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)
HPL_CTA_PER_FCT
: Sets the number of CTAs (thread blocks) for factorization.
- Default Value: 16
- Possible Values: >0
HPL_ALLOC_HUGEPAGES
: Use 2MB hugepages for host-side allocations. Done through the madvise syscall and requires /sys/kernel/mm/transparent_hugepage/enabled
to be set to madvise to have an effect.
- Default Value: 0
- Possible Values: 1 (enable), 0 (disable)
WARMUP_END_PROG
: Runs the main loop once before the 'real' run. Stops the warmup loop at x%.
- Default Value: -1
- Possible Values: -1-100
TEST_LOOPS
: Runs the main loop x many times.
- Default Value: 1
- Possible Values: >0
HPL_CUSOLVER_MP_TESTS
: Runs several tests of individual components of HPL (GEMMS, comms, etc.).
- Default Value: 1
- Possible Values: 1 (enable), 0 (disable)
HPL_CUSOLVER_MP_TESTS_GEMM_ITERS
: Number of repeat GEMM calls in tests.
- Default Value: 128
- Possible Values: >0
Environment variables to setup and control the NVIDIA HPL Benchmark out-of-core mode:
HPL_OOC_MODE
: Enables/disables out-of-core mode.
- Default Value: 10
- Possible Values: 1 (enable), 0 (disable)
HPL_OOC_MAX_GPU_MEM
: Limits the amount of GPU memory used for OOC (measured in GiB).
- Default Value: -1
- Possible Values: >=-1
HPL_OOC_TILE_M
: Row blocking factor.
- Default Value: 4096
- Possible Values: >0
HPL_OOC_TILE_N
: Column blocking factor.
- Default Value: 4096
- Possible Values: >0
HPL_OOC_NUM_STREAMS
: Number of streams used for OOC operations.
- Default Value: 3
- Possible Values: >0
HPL_OOC_SAFE_SIZE
: GPU memory (in GiB) needed for driver, this amount of memory will not be used by HPL OOC.
- Default Value: 2.0
- Possible Values: >0
The examples below use Pyxis/enroot from NVIDIA to facilitate running HPC-Benchmarks Containers. Note that an enroot .credentials
file is necessary to use these NGC containers.
To copy and customize the sample Slurm scripts and/or sample HPL.dat/hpcg.dat files from the containers, run the container in interactive mode, while mounting a folder outside the container, and copy the needed files, as follows:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.09'
MOUNT="$PWD:/home_pwd"
srun -N 1 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
--pty bash
Once inside the container, copy the needed files to /home_pwd
.
NVIDIA HPL
runSeveral sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64
or /workspace/hpl-linux-aarch64-gpu
.
To run NVIDIA HPL
on a single node with 4 GPUs using your custom HPL.dat file:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpl.sh --dat /my-dat-files/HPL.dat
To run NVIDIA HPL
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.09'
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
CONT='nvcr.io#nvidia/hpc-benchmarks:24.09'
srun -N 8 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
NVIDIA HPL-MxP
runSeveral sample Slurm scripts and are available in the container at /workspace/hpl-mxp-linux-x86_64
or /workspace/hpl-mxp-linux-aarch64-gpu
.
To run NVIDIA HPL-MxP on a single node with 8 GPUs:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
srun -N 1 --ntasks-per-node=8 \
--container-image="${CONT}" \
./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
To run NVIDIA HPL-MxP on 4 nodes, each node with 4 GPUs:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
srun -N 4 --ntasks-per-node=4 \
--container-image="${CONT}" \
./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3
Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.
NVIDIA HPCG
runSeveral sample Slurm scripts and sample input file are available in the container at /workspace/hpcg-linux-x86_64
or /workspace/hpcg-linux-aarch64
To run NVIDIA HPCG
on a single node with one GPU using your custom hpcg.dat file on x86:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --dat /my-dat-files/hpcg.dat
To run NVIDIA HPCG
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2
To run NVIDIA HPCG
on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on aarch64:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-arch64.sh --nx 256 --ny 256 --nz 256 --rt 2
NVIDIA HPL
runSeveral sample input files are available in the container at /workspace/hpl-linux-aarch64
.
To run NVIDIA HPL
on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
NVIDIA HPL-MxP
runTo run NVIDIA HPL-MxP on a single node of NVIDIA Grace Hopper x4:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
srun -N 1 --ntasks-per-node=16 \
--container-image="${CONT}" \
./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
--cpu-affinity 0-71:72-143:144-215:216-287 \
--mem-affinity 0:1:2:3
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
NVIDIA HPCG
runSample input file is available in the container at /workspace/hpcg-linux-aarch64
To run NVIDIA HPCG
on two nodes of NVIDIA Grace CPU using your custom parameters:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 30 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1
To run NVIDIA HPCG
on NVIDIA Grace Hopper x4 using script parameters on aarch64:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
#GPU+Grace (Heterogeneous execution)
#GPU rank has 8 OpenMP threads and Grace rank has 64 OpenMP threads
srun -N 2 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
--exm 2 --ddm 2 --lpm 1 --g2c 64 \
--npx 4 --npy 4 --npz 1 \
--cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
--mem-affinity 0:0:1:1:2:2:3:3
The instructions below assume Singularity 3.4.1 or later.
Save the HPC-Benchmark container as a local Singularity image file:
$ singularity pull --docker-login hpc-benchmarks:24.09.sif docker://nvcr.io/nvidia/hpc-benchmarks:24.09
This command saves the container in the current directory as hpc-benchmarks:24.09.sif
.
NVIDIA HPL
runSeveral sample Slurm scripts and several sample input files are available in the container at /workspace/hpl-linux-x86_64
or /workspace/hpl-linux-aarch64-gpu
.
To run NVIDIA HPL
on a single node with 4 GPUs using your custom HPL.dat file:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=4 singularity run --nv \
-B "${MOUNT}" "${CONT}" \
./hpl.sh --dat /my-dat-files/HPL.dat
To run NVIDIA HPL
on 16 nodes with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:
CONT='/path/to/hpc-benchmarks:24.09.sif'
srun -N 16 --ntasks-per-node=4 singularity run --nv \
"${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
CONT='/path/to/hpc-benchmarks:24.09.sif'
srun -N 8 --ntasks-per-node=8 singularity run --nv \
"${CONT}" \
./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
NVIDIA HPL-MxP
runSeveral sample Slurm scripts are available in the container at /workspace/hpl-mxp-linux-x86_64
or /workspace/hpl-mxp-linux-aarch64-gpu
.
To run NVIDIA HPL-MxP on a single node with 8 GPUs:
CONT='/path/to/hpc-benchmarks:24.09.sif'
srun -N 1 --ntasks-per-node=8 singularity run --nv \
"${CONT}" \
./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
To run NVIDIA HPL-MxP on a 4 nodes, each node with 4 GPUs:
CONT='/path/to/hpc-benchmarks:24.09.sif'
srun -N 4 --ntasks-per-node=4 singularity run --nv \
"${CONT}" \
./hpl-mxp.sh --n 280000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3
Pay special attention to CPU cores affinity/binding, as it greatly affects the performance of the HPL benchmarks.
NVIDIA HPCG
runSeveral sample Slurm scripts and sample input files are available in the container at /workspace/hpcg-linux-x86_64
or /workspace/hpcg-linux-aarch64-gpu
To run NVIDIA HPCG
on a single node with one GPU using your custom hpcg.dat file on x86:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --dat /my-dat-files/hpcg.dat
To run NVIDIA HPCG
on 16 nodes with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2
To run NVIDIA HPCG
on a single node with one 4 GPUs using your custom hpcg.dat file on aarch64:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
--container-image="${CONT}" \
--container-mounts="${MOUNT}" \
./hpcg-aarch64.sh --dat /my-dat-files/hpcg.dat
NVIDIA HPL
runSeveral sample input files are available in the container at /workspace/hpl-linux-aarch64
.
To run NVIDIA HPL
on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=2 singularity run \
-B "${MOUNT}" "${CONT}" \
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
NVIDIA HPL-MxP
runTo run NVIDIA HPL-MxP on a single node of NVIDIA Grace Hopper x4:
CONT='/path/to/hpc-benchmarks:24.09.sif'
srun -N 1 --ntasks-per-node=16 singularity run \
"${CONT}" \
./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
--cpu-affinity 0-71:72-143:144-215:216-287 \
--mem-affinity 0:1:2:3
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
NVIDIA HPCG
runSample input files are available in the container at /workspace/hpcg-linux-aarch64
To run NVIDIA HPCG
on two nodes of NVIDIA Grace CPU using your custom hpcg.dat file:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
srun -N 2 --ntasks-per-node=4 singularity run \
-B "${MOUNT}" "${CONT}" \
./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1
To run NVIDIA HPCG
on NVIDIA Grace Hopper x4 using script parameters on aarch64:
CONT='/path/to/hpc-benchmarks:24.09.sif'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
#GPU+Grace (Heterogeneous execution)
srun -N 2 --ntasks-per-node=8 singularity run \
-B "${MOUNT}" "${CONT}" \
./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
--exm 2 --ddm 2 --lpm 1 --g2c 64 \
--npx 4 --npy 4 --npz 1 \
--cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
--mem-affinity 0:0:1:1:2:2:3:3
The below examples are for single node runs with Docker. It is not recommended to use Docker for multi-node runs.
Download the HPL-Benchmark container as a local Docker image file:
$ docker pull nvcr.io/nvidia/hpc-benchmarks:24.09
NOTE: Adding --privileged
flag to the Docker command prevents the “set_mempolicy” error.
NVIDIA HPL
runTo run NVIDIA HPL
on a single node with 4 GPUs using your custom HPL.dat file:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"
docker run --gpus all --shm-size=1g -v ${MOUNT} \
${CONT} \
mpirun --bind-to none -np 4 \
./hpl.sh --dat /my-dat-files/HPL.dat
NVIDIA HPL-MxP
runTo run NVIDIA HPL-MxP on a single node with 8 GPUs:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
docker run --gpus all --shm-size=1g \
${CONT} \
mpirun --bind-to none -np 8 \
./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
NVIDIA HPCG
runTo run NVIDIA HPCG
on a single node with one GPU using your custom hpcg.dat file on x86:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/full-path/to/your/custom/dat-files:/my-dat-files"
docker run --gpus all -v --shm-size=1g ${MOUNT} \
${CONT} \
mpirun --bind-to none -np 8 \
./hpcg.sh --dat /my-dat-files/hpcg.dat
NVIDIA HPL
runSeveral sample Docker run scripts are available in the container at /workspace/hpl-linux-aarch64
.
To run NVIDIA HPL
on a single NVIDIA Grace CPU mode using your custom HPL.dat file:
CONT='nvcr.io#nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
docker run -v ${MOUNT} \
"${CONT}" \
mpirun --bind-to none -np 2 \
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
NVIDIA HPL-MxP
runSeveral sample Docker run scripts are available in the container at /workspace/hpl-mxp-linux-aarch64
.
To run NVIDIA HPL-MxP on a single node of NVIDIA Grace Hopper x4:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
docker run \
"${CONT}" \
mpirun --bind-to none -np 4 \
./hpl-mxp-aarch64.sh --n 380000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
--cpu-affinity 0-71:72-143:144-215:216-287 --mem-affinity 0:1:2:3
where --cpu-affinity
is mapping to cores on the local node and --mem-affinity
is mapping to NUMA-nodes on the local node.
NVIDIA HPCG
runSeveral sample Docker run scripts are available in the container at /workspace/hpcg-linux-aarch64
.
To run NVIDIA HPCG
on a single node of NVIDIA Grace CPU using your custom parameters file:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
docker run -v ${MOUNT} \
"${CONT}" \
mpirun --bind-to none -np 4 \
./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 10 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1
To run NVIDIA HPCG
on NVIDIA Grace Hopper x4 using script parameters on aarch64:
CONT='nvcr.io/nvidia/hpc-benchmarks:24.09'
MOUNT="/path/to/your/custom/dat-files:/my-dat-files"
#GPU+Grace (Heterogeneous execution)
docker run -v ${MOUNT} \
"${CONT}" \
mpirun --bind-to none -np 16 \
./hpcg-arch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
--exm 2 --ddm 2 --lpm 1 --g2c 64 \
--npx 4 --npy 4 --npz 1 \
--cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
--mem-affinity 0:0:1:1:2:2:3:3
HPL out-of-core (OOC): In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE
environment variable. Default value is 2.0
(the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.
HPL-MxP: The following error may appear for rare combinations of N (matrix size) and PxQ decomposition: /workspace/hpl-mxp-linux-x86_64/xhpl_mxp: symbol lookup error: /opt/hpcx/ucx/lib/ucx/libuct_cuda_gdrcopy.so.0: undefined symbol: gdr_get_info_v2
. The workaround is to disable GDRCopy in UCX using the following enviroment command export UCX_TLS=^gdr_copy
HPL-MxP: The input task must satisfy the following condition:
((N / NB) / npcol) / u-panel-chunk-nbs < 20
N
- size of N-by-N matrix NB
- the blocking constant (panel size) npcol
- number of columns in the processor gridu-panel-chunk-nbs
- U panel chunk size given in unit of NBs (default 8)