Scalable GPU programming for exascale: multi-node and multi-GPU strategies

Nov 16, 2025

SystemJUWELS (JSC)JUPITER (JSC, EuroHPC)Colossus (xAI)
Year2018 (Cluster) / 2020 (Booster)2024–2025 (Booster deployment → exascale)2024–2025 (Phase 1 online)
PurposeHPC (petascale)Exascale HPC + AILarge-scale AI training
ComputeCluster (Xeon) + Booster (EPYC + A100)~6000 GH200 nodes (4× GH200 each)~100,000–200,000 NVIDIA H100/H200 GPUs
GPU Count3,744 A100 GPUs (Booster)~24,000 Grace-Hopper GPUs100k+ Hopper-class GPUs
CPUXeon / EPYC4× Grace CPUs per node (72 cores each)Minimal CPU role (GPU-centric)
Peak Perf~70 PFLOP/s~1 EFLOP/s target (FP64)AI-precision PFLOPs (not FP64-oriented)
NetworkHDR InfiniBand (DragonFly+)NDR200 InfiniBand (Quantum-2)NVIDIA Spectrum-X Ethernet
PowerTens of MW~18 MW (Booster)150–300 MW (estimated)
Use-caseScientific simulationExtreme HPC + AI hybridMassive AI model training
Cost~€145M (modules combined)~€500M (EuroHPC contract value)~$10B–$15B (industry estimates for 100k–200k H100/H200 GPUs)

Four exascale systems—Frontier, Aurora, El Capitan, and JUPITER—are now deployed, marking GPUs as the central computing engines of this era of HPC. Although JUPITER is down during SC25, special thanks go to Andreas Herten You can find the slides and all source code below at the team’s repo. Credit goes to Simon Garcia de Gonzalo, Andreas Herten, Lena Oden, David Appelhans, Markus Hrywniak, and Jiri Kraus. for providing courtesy access to JUWELS at Jülich, which made for some enjoyable weekend experimentation.

********************************************************************************
*  Welcome to                                                                  *
*        _ _   ___        _______ _     ____                                   *
*      | | | | \ \  /\  / /  ___| |   /____|                                   *
*   _  | | | | |\ \/  \/ /|  _| | |   \___ \                                   *
*  | |_| | |_| | \  /\  / | |___| |___ ___) |     Juelich Wizard               *
*   \___/ \___/ __\/  \/__|_____|_____|____/         for                       *
*  |  _ \ / _ \ / _ \/ ___|_   _|  ___|  _ \        European Leadership        *
*  | |_) | | | | | | |___ \ | | | |_  | |_) |         Science                  *
*  | |_) | |_| | |_| |___) || | | |___| |\ \            Booster                *
*  |____/ \___/ \___/|____/ |_| |_____|_| \_\                                  *
*                                                                              *
********************************************************************************
* Information about the system, latest changes, user documentation and FAQs:   *
*  -> https://go.fzj.de/JUWELS   -> https://go.fzj.de/juwels-known-issues      *
* JUWELS booster slurm job reports:                                            *
*  -> https://go.fzj.de/llview-juwelsbooster                                   *
********************************************************************************

                                                         2023-04-20T14:00+02:00
 ## Enabling large jobs in JUWELS Booster ##

 The largebooster partition is down during regular production times. Users
 wanting to use it (for more than 384 nodes) are allowed to submit to it, but
 are advised to contact sc@fz-juelich.de for appropriate scheduling of those
 jobs, possibly together with other large jobs. If these requests increase in
 frequency we plan to organize regular "big days" where the system will be used
 mostly for large scale runs.

********************************************************************************
* Status information also at https://go.fzj.de/status-juwels-booster           *
********************************************************************************
                                                          2025-03-31T15:54+0200
Information - Do not constrain GPU devices with Slurm
Due to a bug in Slurm the GPU devices will not be constrained by cgroups until
it is properly fixed.
--------------------------------------------------------------------------------
                                              Starting on 2025-12-03T07:00+0100
Planned - Datacenter Maintenance
A planned maintenance for the network backbone at the Jülich Supercomputing
Centre (JSC) datacenter is scheduled from 07:00 CET on 2025-12-03 to 19:00 CET
on 2025-12-04.

During this period, all HPC, data, and cloud systems at JSC will be unavailable.
********************************************************************************
[wood2@jwlogin04 ~]$ 

CUDA-aware MPI

RDMA

+-----------+                +-----------+
|   GPU 0   |                |   GPU 1   |
| (sender)  |                | (receiver)|
+-----------+                +-----------+
      |                            |
      | DMA over PCIe              | DMA over PCIe
      v                            ^
+-----------+                +-----------+
| Host RAM  | <-- CPU -->    | Host RAM  |
+-----------+                +-----------+
           \                /
            \              /
              \          /
                \      /
               Network (MPI)

Standard MPI (not CUDA-aware) only understands host pointers. You normally must do:

cudaMemcpy(host_buf, dev_buf, n, cudaMemcpyDeviceToHost);
MPI_Send(host_buf, n, MPI_BYTE, ...);

and the receiver does:

MPI_Recv(host_buf, n, MPI_BYTE, ...);
cudaMemcpy(dev_buf, host_buf, n, cudaMemcpyHostToDevice);

CUDA-aware MPI allows MPI communication routines to accept device pointers (i.e., pointers returned by cudaMalloc) directly as send and receive buffers.

+------------------+          PCIe           +------------------+
|      GPU 0       | <---------------------> |      NIC 0       |
|  (Sender Device) |                         | (Sender Adapter) |
+------------------+                         +------------------+
          |                                          |
          |            InfiniBand Network            |
          |      (RDMA READ / RDMA WRITE ops)        |
          v                                          v
+------------------+                         +------------------+
|      NIC 1       | <---------------------> |      GPU 1       |
| (Receiver Adapter)|          PCIe          | (Receiver Device)|
+------------------+                         +------------------+
// MPI rank 0
MPI_Send(s_buf_d, n, MPI_BYTE, size-1, tag, MPI_COMM_WORLD);

// MPI rank size-1
MPI_Recv(r_buf_d, n, MPI_BYTE, 0, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

Here, s_buf_d and r_buf_d are GPU pointers, enabling direct communication between device memory regions without explicit host staging.

To build an application that combines CUDA kernels with MPI host code, you typically compile in two steps:

nvcc -o my_kernel.o $(NVCC_FLAGS) my_kernels.cu -c
mpicc -o my_multiGPUapp -lcudart my_kernel.o my_multiGPUapp.c

The output executable my_multiGPUapp supports both MPI and CUDA.

# Simple Makefile for multi-GPU Jacobi solver example
THIS_TASK := 03H-task
OUTPUT_NAME := jacobi.$(THIS_TASK)__$(shell date '+%Y%m%d-%H%M')
NP ?= 4
NVCC = nvcc
JSC_SUBMIT_CMD ?= srun --gres=gpu:4 --ntasks-per-node 4
CUDA_HOME ?= /usr/local/cuda
_GPU_ARCH ?= 80
GENCODE_FLAGS := -gencode arch=compute_$(_GPU_ARCH),code=sm_$(_GPU_ARCH) -gencode arch=compute_$(_GPU_ARCH),code=compute_$(_GPU_ARCH)
NVCC_FLAGS := -dc -Xcompiler -fopenmp -lineinfo -std=c++14 $(GENCODE_FLAGS) -I$(MPI_HOME)/include
NVCC_LDFLAGS := -ccbin=mpic++ -L$(MPI_HOME)/lib -lmpi -L$(CUDA_HOME)/lib64 -lcuda -lcudart

jacobi: jacobi.cu
	$(NVCC) $(NVCC_FLAGS) jacobi.cu -c -o jacobi.o
	$(NVCC) $(GENCODE_FLAGS) jacobi.o -o jacobi $(NVCC_LDFLAGS)

run: jacobi
	$(JSC_SUBMIT_CMD) -n $(NP) ./jacobi

profile: jacobi
	$(JSC_SUBMIT_CMD) -n $(NP) nsys profile --trace=mpi,cuda,nvtx -o $(OUTPUT_NAME).%q{SLURM_PROCID} ./jacobi -niter 10

CUDA Unified Virtual Addressing (UVA) provides a single, unified virtual address space that includes both CPU and GPU memory. With UVA, the numerical value of a pointer uniquely identifies where its data resides—whether in system RAM, local GPU memory, or peer GPU memory—allowing CUDA and external libraries to automatically determine the correct memory location from the pointer alone. This greatly simplifies programming models and enables features like CUDA-aware MPI, peer-to-peer memory transfers, and unified memory management. UVA is supported on all modern 64-bit systems with GPUs of compute capability 2.0 or higher.

Blocking vs async MPI communications

1. Blocking MPI Communication (Host Buffers, No CUDA-Aware MPI)

Rank 0 (sender)                           Rank 1 (receiver)
-------------------------                 -------------------------
 GPU0           Host0                     Host1           GPU1
  |              |                         |              |
  | cudaMemcpy   |                         |              |
  v              v                         v              v
[d_send] -> [h_send] -- MPI_Send --> [h_recv] -> [d_recv]

Timeline (blocking):
1. GPU0 -> Host0: cudaMemcpy (D2H)
2. Host0 -> Host1: MPI_Send/MPI_Recv
3. Host1 -> GPU1: cudaMemcpy (H2D)
4. Both ranks blocked during MPI_Send/Recv

// Blocking send/recv using host buffers

// Sender (rank 0)
cudaMemcpy(h_send, d_send, n, cudaMemcpyDeviceToHost);
MPI_Send(h_send, n, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);

// Receiver (rank 1)
MPI_Recv(h_recv, n, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
cudaMemcpy(d_recv, h_recv, n, cudaMemcpyHostToDevice);

In blocking MPI communication, routines like MPI_Send and MPI_Recv do not return control to the program until the operation reaches a completion state defined by the MPI standard. For MPI_Send, this typically means that the message data has been copied out of the application buffer—either delivered to the receiver or stored in an internal MPI buffer—making it safe for the sender to reuse or modify the buffer. For MPI_Recv, completion means that the incoming message has been fully transferred into the receive buffer. Blocking communication therefore imposes implicit synchronization between sender and receiver, and both processes may stall until the corresponding operation is matched and the transfer has progressed far enough. This simplicity is attractive for correctness, but it often limits opportunities for overlapping computation and communication.

2. RDMA With Blocking MPI (CUDA-Aware MPI)

Data path:
 GPU0 --PCIe--> NIC0 --InfiniBand--> NIC1 --PCIe--> GPU1

Rank 0 (sender)                      Rank 1 (receiver)
-----------------                    -----------------
MPI_Send(d_send)                     MPI_Recv(d_recv)
    |                                     |
    |---- RDMA transfer in HW ----------->|
    |                                     |
<--- returns when send complete      <---- returns when data arrived

// Blocking send/recv using device pointers (GPUDirect RDMA-enabled MPI)

// Sender (rank 0)
MPI_Send(d_send, n, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);

// Receiver (rank 1)
MPI_Recv(d_recv, n, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

GPUDirect RDMA enhances MPI communication efficiency by enabling network adapters to read from and write to GPU memory directly over PCIe, bypassing the CPU and avoiding host-staging copies. However, when MPI operations are issued via blocking calls such as MPI_Send and MPI_Recv, the application still waits for the communication operation to reach the MPI-defined completion condition—even though the data movement itself is offloaded to hardware. The transfer is faster, but the blocking semantics do not change: MPI calls will only return once the NIC has initiated or completed the required RDMA operations. Thus, GPUDirect RDMA accelerates data motion by eliminating redundant copies, but it does not by itself provide computation–communication overlap or eliminate synchronization; those properties depend on the choice of MPI communication mode.

3. Asynchronous MPI + RDMA (CUDA-Aware MPI + Overlap)

Time 
Rank 0 (sender)                      Rank 1 (receiver)
-----------------------------        -----------------------------
[ MPI_Isend(d_send) ]                [ MPI_Irecv(d_recv) ]
          |                                      |
          v                                      v
      WORK (GPU/CPU)                         WORK (GPU/CPU)
          |                                      |
     [ MPI_Wait ]                           [ MPI_Wait ]

Background data path (in hardware):
 GPU0 --PCIe--> NIC0 --InfiniBand--> NIC1 --PCIe--> GPU1
        <------------- RDMA transfer ------------->

// Non-blocking transfer using device buffers

MPI_Request reqs[2];

// Rank 0
MPI_Isend(d_send, n, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &reqs[0]);
// do useful work here...
// do_gpu_kernel<<<...>>>( ... );
MPI_Wait(&reqs[0], MPI_STATUS_IGNORE);

// Rank 1
MPI_Irecv(d_recv, n, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &reqs[1]);
// do useful work here...
MPI_Wait(&reqs[1], MPI_STATUS_IGNORE);

Asynchronous communication in MPI, typically using MPI_Isend and MPI_Irecv, initiates transfers but returns immediately, allowing the application to perform useful work while data is in transit. When combined with CUDA-aware MPI and GPUDirect RDMA, these non-blocking operations enable true overlap of communication and computation: the NIC can directly pull or push data from GPU memory while the GPU continues executing kernels and the CPU continues issuing work. Completion is checked later using MPI_Test or enforced using MPI_Wait. This model is essential for achieving high performance on distributed multi-GPU systems, as it hides communication latency behind computation and reduces idle time on both CPUs and GPUs. However, it requires careful dependency management to ensure that data is not overwritten before the transfer completes or used before it has arrived.

Jacobi Solver for the 2D Laplace Equation

Introduction

We consider the numerical solution of the two-dimensional Laplace equation

\[\Delta u(x,y) = 0, \qquad (x,y) \in \Omega \setminus \partial\Omega,\]

on a rectangular domain \(\Omega\). The left and right boundaries impose Dirichlet conditions,

\[u = g(x,y) \quad \text{on } \partial\Omega_{\mathrm{L/R}},\]

while the top and bottom boundaries impose Neumann (reflecting) conditions,

\[\frac{\partial u}{\partial n} = 0 \quad \text{on } \partial\Omega_{\mathrm{T/B}}.\]

This problem describes a steady-state harmonic field with both prescribed and insulated boundaries. It is a fundamental model for elliptic PDEs and is widely used in numerical analysis, scientific computing, and parallel algorithm design.

// =========================================================
//  Five-point stencil approximation (2D Laplace / Jacobi)
// =========================================================
//
//    Each interior grid point u(i,j) is updated using its
//    four direct neighbours (up, down, left, right):
//
//           j+1
//            ^
//            |
//            |
//        +-------+
//        |       |
//        |  u(i,j+1)   (north / top neighbor)
//        |       |
//+-------+-------+-------+--> i
//|       |       |       |
//| u(i-1,j)  u(i,j)  u(i+1,j)
//|  (west)   (center)   (east)
//+-------+-------+-------+
//        |       |
//        |  u(i,j-1)   (south / bottom neighbor)
//        |       |
//        +-------+
//
//  Discrete Laplacian (uniform spacing h):
//
//      ( -u(i-1,j) - u(i+1,j)
//        -u(i,j-1) - u(i,j+1)
//        +4*u(i,j) ) / h^2  ≈  0
//
//  In the Jacobi iteration, the new value at (i,j) is
//
//      u_new(i,j) = 0.25 * [ u_old(i-1,j) + u_old(i+1,j)
//                            u_old(i,j-1) + u_old(i,j+1) ];
//
//  which is exactly what the CUDA kernel computes using
//  the 1D index mapping iy, ix and the row length nx.
// =========================================================

The domain is discretized on a uniform \(N_x \times N_y\) Cartesian grid with spacing \(h\). Using the standard five-point stencil approximation of the Laplacian. The second derivatives at grid point \((x_i, y_j)\) are approximated using central differences:

\[\frac{\partial^2 u}{\partial x^2}\Big|_{(x_i, y_j)} \approx \frac{u_{i-1,j} - 2u_{i,j} + u_{i+1,j}}{h^2} \quad \text{and} \quad \frac{\partial^2 u}{\partial y^2}\Big|_{(x_i, y_j)} \approx \frac{u_{i,j-1} - 2u_{i,j} + u_{i,j+1}}{h^2}\]

The Laplacian is the sum:

\[\begin{align*} \Delta u(x_i, y_j) &= \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} \\ &\approx \frac{u_{i-1,j} - 2u_{i,j} + u_{i+1,j}}{h^2} + \frac{u_{i,j-1} - 2u_{i,j} + u_{i,j+1}}{h^2} \\ &= \frac{1}{h^2} \left[ u_{i-1,j} + u_{i+1,j} + u_{i,j-1} + u_{i,j+1} - 4u_{i,j} \right] \end{align*}\]

This yields the discrete Laplace equation:

\[\text{(Discrete Laplace Equation)} \qquad -4 u_{i,j} + u_{i-1,j} + u_{i+1,j} + u_{i,j-1} + u_{i,j+1} = 0, \quad (i,j) \text{ interior}.\]

To solve the discretized system, we use the Jacobi iteration. At iteration \(t\), each interior grid point is updated via:

\[\text{(Jacobi Update)} \qquad u^{(t+1)}_{i,j} = \frac{1}{4} \left( u^{(t)}_{i-1,j} + u^{(t)}_{i+1,j} + u^{(t)}_{i,j-1} + u^{(t)}_{i,j+1} \right).\]

Because all updates at iteration \(t+1\) depend exclusively on values from iteration \(t\), the method is synchronous and conceptually simple. A common stopping criterion is:

\[\| u^{(t+1)} - u^{(t)} \|_{\infty} < \varepsilon.\]

Here is a diagrammatic grid representing \(i\) and \(j\) indices where \(i = 1,\ldots,n_x\) and \(j = 1,\ldots,n_y\):

\[\begin{array}{c|cccccc} & i=1 & i=2 & i=3 & \cdots & i=n_x \\ \hline j=n_y & (1, n_y) & (2, n_y) & (3, n_y) & \cdots & (n_x, n_y) \\ \vdots &\vdots &\vdots &\vdots & &\vdots \\ j=3 & (1,3) & (2,3) & (3,3) & \cdots & (n_x,3) \\ j=2 & (1,2) & (2,2) & (3,2) & \cdots & (n_x,2) \\ j=1 & (1,1) & (2,1) & (3,1) & \cdots & (n_x,1) \\ \end{array}\]

We convert the 2D unknowns \(u_{i,j}\) into a single long vector by scanning left-to-right in \(i\) and bottom-to-top in \(j\):

\[u = \begin{bmatrix} u_{1,1} & u_{2,1} & \cdots & u_{n_x,1} & u_{1,2} & u_{2,2} & \cdots & u_{n_x,2} & \cdots & u_{1,n_y} & \cdots & u_{n_x,n_y} \end{bmatrix}^T.\]

With the 1D index \(k = i + (j-1)n_x\), we have:

\[- 4u_{k} + u_{k-1} + u_{k+1} + u_{k-n_x} + u_{k+n_x} = 0.\]

The corresponding Jacobi update in 1D-indexed notation is:

\[u_k^{(t+1)} = \frac{1}{4} \left( u_{k-1}^{(t)} + u_{k+1}^{(t)} + u_{k-n_x}^{(t)} + u_{k+n_x}^{(t)} \right)\]

where each update \(u_k^{(t+1)}\) uses only the neighboring values from the previous iteration \(t\).

int iy = blockIdx.y * blockDim.y + threadIdx.y;
int ix = blockIdx.x * blockDim.x + threadIdx.x;

if (iy < ny - 1 && ix < nx - 1) {
    new_val =
        0.25 * (a[iy * nx + ix + 1] +
                a[iy * nx + ix - 1] +
                a[(iy + 1) * nx + ix] +
                a[(iy - 1) * nx + ix]);

    a_new[iy * nx + ix] = new_val;
}

Stacking all interior equations yields the matrix system

\[\text{(Matrix Representation of Discrete Laplace Equation)} \quad A u = b,\]

where \(A\) is an \(N \times N\) sparse matrix with \(N = n_x n_y\). The matrix $A$ can be written compactly as the Kronecker sum

\[A = I_{n_y} \otimes T_x \;+\; T_y \otimes I_{n_x},\]

where \(T_x\) and \(T_y\) are the one–dimensional discrete Laplacian matrices,

\[T_x = T_y = \begin{bmatrix} 2 & -1 & 0 & \cdots & 0 \\ -1 & 2 & -1 & \cdots & 0 \\ 0 & -1 & 2 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & -1 \\ 0 & 0 & 0 & -1 & 2 \end{bmatrix}.\]

The Jacobi method is inherently parallel because each update requires only values from the previous iteration. This makes it a canonical benchmark for parallel stencil computations.

The domain can be decomposed into \(P\) subdomains:

\[\Omega = \Omega^{(1)} \cup \cdots \cup \Omega^{(P)},\]

with each processor handling one subdomain. Processors update their subdomains in parallel.

The five-point stencil references nearest neighbors. Subdomain boundaries require halo exchange:

For the top and bottom boundaries, communication is straightforward because rows are stored contiguously in memory in a row‑major array. Each process has several interior rows plus an extra top and bottom ghost row. In the first MPI_Sendrecv call, the process sends its last interior row (the row just above the bottom ghost row) to its bottom neighbour and simultaneously receives a row from its top neighbour into its top ghost row. In the second MPI_Sendrecv call, it sends its first interior row (just below the top ghost row) to its top neighbour and receives a row from its bottom neighbour into its bottom ghost row. The different tags distinguish the “upward” and “downward” messages. Using MPI_Sendrecv lets the code send and receive in one call and avoids deadlocks.

// =========================================================
//  Domain layout on each MPI rank (top/bottom decomposition)
// =========================================================
//
//          ^  t_nb  (top neighbor)
//          |
//   +----------------------------+
//   |    top ghost row           |  <-- receives from t_nb
//   +----------------------------+
//   |    first interior row      |  <-- send to t_nb
//   |    ...                     |
//   |    ... interior rows ...   |
//   |    ...                     |
//   |    last interior row       |  <-- send to b_nb
//   +----------------------------+
//   |    bottom ghost row        |  <-- receives from b_nb
//   +----------------------------+
//          |
//          v  b_nb  (bottom neighbor)
//
//  Each row has (m-2) interior elements; left/right boundaries are excluded.
// =========================================================

// Send last interior row to bottom neighbor, receive from top neighbor into top ghost row
MPI_Sendrecv(
    a_new_d + offset_last_row,     m-2, MPI_DOUBLE,  // send buffer to bottom neighbor
    b_nb, 1,
    a_new_d + offset_top_boundary, m-2, MPI_DOUBLE,  // receive buffer from top neighbor
    t_nb, 1,
    MPI_COMM_WORLD,
    MPI_STATUS_IGNORE
);

// =========================================================
//  After the first exchange:
//    - bottom neighbor now has our last interior row
//    - our top ghost row now contains data from t_nb
// =========================================================

// Send first interior row to top neighbor, receive from bottom neighbor into bottom ghost row
MPI_Sendrecv(
    a_new_d + offset_first_row,       m-2, MPI_DOUBLE,  // send buffer to top neighbor
    t_nb, 0,
    a_new_d + offset_bottom_boundary, m-2, MPI_DOUBLE,  // receive buffer from bottom neighbor
    b_nb, 0,
    MPI_COMM_WORLD,
    MPI_STATUS_IGNORE
);

// =========================================================
//  After the second exchange:
//    - top neighbor now has our first interior row
//    - our bottom ghost row now contains data from b_nb
// =========================================================

The left and right boundaries are trickier because columns in a row‑major array are not stored contiguously.See Simon Garcia de Gonzalo and Lena Oden, Introduction to MPIDistributed Computing with GPUs. You cannot just send a pointer to a whole column and expect MPI to see a single block of memory. To fix this, the code first launches a CUDA kernel called pack. This kernel walks over all interior rows and copies the left interior column values into a separate contiguous buffer to_left_d on the device. After this packing step completes, the MPI_Sendrecv call uses to_left_d as a normal 1D array: it sends this packed column to the left neighbour and receives a column from the right neighbour into another contiguous buffer from_right_d. Finally, another CUDA kernel called unpack takes the received buffer and scatters its entries back into the appropriate locations of the right ghost column in the 2D array on the GPU. A symmetric sequence (not fully shown in your screenshot) does the same in the opposite direction, sending data to the right neighbour and receiving from the left.

// =========================================================
//  Domain layout on each MPI rank (left/right decomposition)
// =========================================================
//
//  l_nb (left neighbor)                        r_nb (right neighbor)
//        |                                           ^
//        |                                           |
//        v                                           |
//
//    row i (schematic, row-major storage)
//
//        +----+----------- interior -----------+----+
//        | LG |   C1   C2   ...   Ck   ...   CL | RG |
//        +----+----------- ----------- --------+----+
//          ^                               ^     ^
//          |                               |     |
//          |                               |     +-- right ghost column
//          |                               +-------- last interior column
//          +--------------------------------------- left interior column
//
//  For the left/right halo exchange shown below:
//
//    - We SEND our left interior column   (C1)      → l_nb
//    - We RECV right neighbor's interior column     → into our right ghost column (RG)
//
//  Since the grid is stored row-major, a column is not contiguous, so we:
//    1) PACK that column into a contiguous buffer  to_left_d   (CUDA kernel)
//    2) MPI_Sendrecv( to_left_d  ↔  from_right_d )
//    3) UNPACK from_right_d into the right ghost column RG     (CUDA kernel)
//
//  Vertically there are (n-2) interior rows; top/bottom physical boundaries are excluded.
// =========================================================

// right neighbor omitted
pack<<<gs, bs, 0, s>>>(to_left_d, u_new_d, n, m);   // 1) pack left interior column → to_left_d
cudaStreamSynchronize(s);

MPI_Sendrecv(
    to_left_d,     n-2, MPI_DOUBLE, l_nb, 0,        // send packed left column to left neighbor
    from_right_d,  n-2, MPI_DOUBLE, r_nb, 0,        // receive packed right column from right neighbor
    MPI_COMM_WORLD,
    MPI_STATUS_IGNORE
);

unpack<<<gs, bs, 0, s>>>(u_new_d, from_right_d, n, m); // 3) unpack into right ghost column
cudaStreamSynchronize(s);

// =========================================================
//  After this exchange:
//    - left neighbor now has our left interior column
//    - our right ghost column now contains data from r_nb
// =========================================================

Three async schemes

1. Simple async example (non-blocking MPI + CUDA kernel overlap)

In the simple asynchronous example, each rank uses MPI_Isend and MPI_Irecv with device pointers so that communication can progress while the GPU runs kernels. Rank 0 fills a buffer on the GPU, initiates an MPI_Isend from that device buffer, and then launches another kernel to perform computation while the message is in flight. Rank 1 posts an MPI_Irecv into its GPU buffer and similarly performs independent work before calling MPI_Wait. This pattern exploits the fact that non-blocking MPI returns immediately, so the CPU is free to schedule more kernels and perform additional work while the NIC and RDMA engine handle the data transfer in the background. It’s the first step toward hiding communication latency behind useful computation.

Nodes: Rank 0 (sender)                        Rank 1 (receiver)
       GPU0 + CPU0                            GPU1 + CPU1

Time ───────────────────────────────────────────────────────────▶

Rank 0 (CPU):
    MPI_Isend(d_send)
        |
        v
    launch kernel A on GPU0 (work during transfer)
        |
        v
    MPI_Wait (ensure send complete before reusing d_send)

Rank 1 (CPU):
    MPI_Irecv(d_recv)
        |
        v
    launch kernel B on GPU1 (independent work)
        |
        v
    MPI_Wait (ensure receive complete before using d_recv)

Background data movement (hardware / RDMA path):
    d_send (GPU0) --PCIe--> NIC0 --Network--> NIC1 --PCIe--> d_recv (GPU1)
                     <---------- in-flight while kernels run ---------->


// Assume: MPI_Init already called, CUDA initialized, etc.

int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

const int N = 1 << 20; // 1M floats
float *d_buf;
cudaMalloc(&d_buf, N * sizeof(float));

// Simple kernel
__global__ void scale_kernel(float *x, int n, float alpha) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) x[i] *= alpha;
}

MPI_Request req;
int tag = 0;

if (rank == 0) {
    // Fill buffer on GPU
    scale_kernel<<<(N+255)/256, 256>>>(d_buf, N, 1.0f);
    cudaDeviceSynchronize();  // ensure data ready before sending

    // Start async send from GPU memory
    MPI_Isend(d_buf, N, MPI_FLOAT, 1, tag, MPI_COMM_WORLD, &req);

    // Overlap: do more GPU work while data is in flight
    scale_kernel<<<(N+255)/256, 256>>>(d_buf, N, 2.0f);

    // Ensure send is done before reusing d_buf (or exiting)
    MPI_Wait(&req, MPI_STATUS_IGNORE);

} else if (rank == 1) {
    // Post async receive into GPU memory
    MPI_Irecv(d_buf, N, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &req);

    // Overlap: do some other GPU work while waiting
    scale_kernel<<<(N+255)/256, 256>>>(d_buf, N, 0.0f); // e.g., clear

    // Wait until data has arrived
    MPI_Wait(&req, MPI_STATUS_IGNORE);

    // Now data from rank 0 is in d_buf; process it
    scale_kernel<<<(N+255)/256, 256>>>(d_buf, N, 0.5f);
    cudaDeviceSynchronize();
}

cudaFree(d_buf);

2. More realistic version (CUDA streams + events + MPI)

In the more realistic implementation, you explicitly separate compute and communication using multiple CUDA streams and CUDA events. One stream (e.g., stream_comm) is responsible for producing data that will be sent, while another stream (stream_compute) performs unrelated computation. A CUDA event is recorded on stream_comm once the send buffer is ready; the CPU then waits on this event before calling MPI_Isend to ensure MPI doesn’t read incomplete GPU data. After initiating the non-blocking send, the CPU can continue to enqueue more work on stream_compute. On the receiving side, MPI_Irecv is posted early, and the GPU can perform work in its own streams while waiting for the MPI transfer to complete. This design gives fine-grained control over dependencies and allows careful overlapping of communication with multiple streams of GPU computation.

Rank 0 (sender) view
-----------------------------------------
GPU0:
  [stream_comm]:  kernel_produce_data ---> event "ready_for_send"
                          |
                          v
  [stream_compute]:  other_kernels (independent work)

CPU0 timeline:
    launch kernel_produce_data on stream_comm
    record event "ready_for_send" on stream_comm
    launch other_kernels on stream_compute (in parallel)
    cudaEventSynchronize("ready_for_send")  // wait only for send buffer
    MPI_Isend(d_buf, ...)                   // start async send
    launch more work on stream_compute
    MPI_Wait(...)                           // later, ensure send complete

Data path:
  d_buf (GPU0) --PCIe--> NIC0 --Network--> NIC1 --PCIe--> d_buf (GPU1)

Rank 1 (receiver) view
-----------------------------------------
CPU1:
    MPI_Irecv(d_buf, ...)        // into GPU1 memory
    launch other_kernels on stream_compute (GPU1)
    MPI_Wait(...)                // later, ensure data arrival

GPU1:
  [stream_compute]: independent kernels
  [stream_comm]:   kernels that consume received d_buf after MPI_Wait

#include <mpi.h>
#include <cuda_runtime.h>

__global__ void compute_kernel(float *x, int n, float alpha) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) x[i] = alpha * i;
}

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    const int N = 1 << 20;
    float *d_buf;
    cudaMalloc(&d_buf, N * sizeof(float));

    cudaStream_t stream_comm, stream_compute;
    cudaStreamCreate(&stream_comm);
    cudaStreamCreate(&stream_compute);

    cudaEvent_t ready_for_send;
    cudaEventCreate(&ready_for_send);

    int tag = 0;
    MPI_Request req;

    if (rank == 0) {
        // 1) Produce data in stream_comm
        compute_kernel<<<(N+255)/256, 256, 0, stream_comm>>>(d_buf, N, 1.0f);

        // 2) Record event when data is ready
        cudaEventRecord(ready_for_send, stream_comm);

        // 3) Meanwhile, do unrelated compute on another stream
        compute_kernel<<<(N+255)/256, 256, 0, stream_compute>>>(d_buf, N, 2.0f);

        // 4) Ensure send buffer is ready for MPI (sync to event, not whole device)
        cudaEventSynchronize(ready_for_send);

        // 5) Issue non-blocking send from device buffer
        MPI_Isend(d_buf, N, MPI_FLOAT, 1, tag, MPI_COMM_WORLD, &req);

        // 6) Do more compute while MPI is sending
        compute_kernel<<<(N+255)/256, 256, 0, stream_compute>>>(d_buf, N, 3.0f);

        // 7) Only wait when we actually need to know send is complete
        MPI_Wait(&req, MPI_STATUS_IGNORE);

    } else if (rank == 1) {
        // 1) Post receive into device buffer
        MPI_Irecv(d_buf, N, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &req);

        // 2) Do other GPU work while waiting
        compute_kernel<<<(N+255)/256, 256, 0, stream_compute>>>(d_buf, N, 0.0f);

        // 3) Wait until data has arrived
        MPI_Wait(&req, MPI_STATUS_IGNORE);

        // 4) Now operate on the received data
        compute_kernel<<<(N+255)/256, 256, 0, stream_comm>>>(d_buf, N, 0.5f);
        cudaStreamSynchronize(stream_comm);
    }

    cudaEventDestroy(ready_for_send);
    cudaStreamDestroy(stream_comm);
    cudaStreamDestroy(stream_compute);
    cudaFree(d_buf);

    MPI_Finalize();
    return 0;
}

3. Tiny 1D halo-exchange example (CUDA-aware MPI)

In the 1D halo-exchange implementation, the global domain is decomposed into contiguous subdomains, one per MPI rank, and each rank stores its local data plus two halo cells on the GPU: one on the left and one on the right. At each iteration, the rank exchanges boundary values with its neighbor ranks using non-blocking MPI_Isend/MPI_Irecv operations directly on device pointers. While halo data is in flight, the interior points (which don’t depend on new halo data) are updated by a stencil kernel. After MPI_Waitall confirms that halo exchanges are complete, the code can proceed to the next iteration or update boundary-adjacent points if needed. This pattern is the core of many distributed stencil-based solvers (e.g., Jacobi, diffusion, wave equations) and demonstrates how CUDA-aware MPI and asynchronous communication support scalable multi-GPU domain decomposition.

Per-rank data layout on GPU (rank r):
    Index:    0        1         ...      N_local      N_local+1
              |        |                    |             |
           halo_L   interior            interior        halo_R

Neighbors:
    left  neighbor: rank r-1  (or MPI_PROC_NULL at boundary)
    right neighbor: rank r+1

Communication steps per iteration (rank r):

CPU_r:
    // Post receives for halos (device memory)
    MPI_Irecv(&d_u[0],         1, MPI_DOUBLE, left,  tagR, COMM);   // left halo
    MPI_Irecv(&d_u[N_local+1], 1, MPI_DOUBLE, right, tagL, COMM);   // right halo

    // Post sends of boundary interior points
    MPI_Isend(&d_u[1],         1, MPI_DOUBLE, left,  tagL, COMM);   // send left boundary
    MPI_Isend(&d_u[N_local],   1, MPI_DOUBLE, right, tagR, COMM);   // send right boundary

GPU_r:
    // Overlap: update interior points that don't depend on new halos
    kernel_update_interior<<<blocks, threads>>>(d_u[1 .. N_local]);

CPU_r:
    MPI_Waitall(4, reqs, MPI_STATUSES_IGNORE);  // ensure halo exchange done
    // Now halos at d_u[0] and d_u[N_local+1] are up to date

Data movement between ranks:
    Rank r          <----->       Rank r+1
    d_u[N_local]                 d_u[1]
      |                            ^
      | (send right / recv left)   |
      +---------- Network ---------+


#include <mpi.h>
#include <cuda_runtime.h>

__global__ void update_interior(double *u, int N_local) {
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1; // skip left halo
    if (i <= N_local) {
        // simple stencil using neighbors
        u[i] = 0.5 * (u[i-1] + u[i+1]);
    }
}

int main(int argc, char **argv) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    const int N_local = 1024;   // points per rank (excluding halos)
    const int N_tot   = N_local + 2; // +2 for left/right halos

    double *d_u;
    cudaMalloc(&d_u, N_tot * sizeof(double));

    int left  = (rank == 0)        ? MPI_PROC_NULL : rank - 1;
    int right = (rank == size - 1) ? MPI_PROC_NULL : rank + 1;

    MPI_Request reqs[4];

    for (int iter = 0; iter < 1000; ++iter) {
        // Post non-blocking receives for halos
        MPI_Irecv(d_u + 0,         1, MPI_DOUBLE, left,  0, MPI_COMM_WORLD, &reqs[0]); // left halo
        MPI_Irecv(d_u + N_local+1, 1, MPI_DOUBLE, right, 1, MPI_COMM_WORLD, &reqs[1]); // right halo

        // Post non-blocking sends of boundary interior points
        MPI_Isend(d_u + 1,         1, MPI_DOUBLE, left,  1, MPI_COMM_WORLD, &reqs[2]); // left boundary
        MPI_Isend(d_u + N_local,   1, MPI_DOUBLE, right, 0, MPI_COMM_WORLD, &reqs[3]); // right boundary

        // Overlap: update interior while halos are in flight
        int threads = 256;
        int blocks  = (N_local + threads - 1) / threads;
        update_interior<<<blocks, threads>>>(d_u, N_local);

        // Wait for all halo comms to complete before next iteration
        MPI_Waitall(4, reqs, MPI_STATUSES_IGNORE);

        // Optionally sync device if next iteration depends strictly on up-to-date halos
        cudaDeviceSynchronize();
    }

    cudaFree(d_u);
    MPI_Finalize();
    return 0;
}

The task

The purpose of this task is to use CUDA-aware MPI to parallelize a Jacobi solver. The starting point of this task is a skeleton jacobi.cu, in which the CUDA kernel is already defined and some basic setup functions are present. There is also a single-GPU version for performance and numerical comparison. Take some time to get familiar with the code. Some functions (such as NVTX) will be explained in later sessions and can be ignored for now (e.g., the PUSH and POP macros). Once you are familiar, please work on the TODOs in jacobi.cu:

  • Get the available GPU devices and use them and the local rank to set the active GPU for each process.
  • Compute the top and bottom neighbors. We are using reflecting/periodic boundaries on top and bottom, so rank 0’s top neighbor is (size-1) and rank (size-1)’s bottom neighbor is rank 0.
  • Use MPI_Sendrecv to exchange data between the neighbors.
  • Use CUDA-aware MPI, so the send and receive buffers are located in GPU memory.
  • The first newly calculated row (iy_start) is sent to the top neighbor and the bottom boundary row (iy_end) is received from the bottom process.
  • The last calculated row (iy_end-1) is sent to the bottom process and the top boundary (0) is received from the top.
  • Don’t forget to synchronize the computation on the GPU before starting the data transfer.
  • Use the self-defined MPI_REAL_TYPE. This allows an easy switch between single- and double precision.
/* Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
#include <algorithm>
#include <cmath>
#include <cstdio>
#include <iostream>
#include <sstream>
#include <mpi.h>

#ifdef HAVE_CUB
#include <cub/block/block_reduce.cuh>
#endif  // HAVE_CUB

#define MPI_CALL(call)                                                                \
    {                                                                                 \
        int mpi_status = call;                                                        \
        if (0 != mpi_status) {                                                        \
            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
            int mpi_error_string_length = 0;                                          \
            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
            if (NULL != mpi_error_string)                                             \
                fprintf(stderr,                                                       \
                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
                        "with %s "                                                    \
                        "(%d).\n",                                                    \
                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
            else                                                                      \
                fprintf(stderr,                                                       \
                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
                        "with %d.\n",                                                 \
                        #call, __LINE__, __FILE__, mpi_status);                       \
        }                                                                             \
    }

#include <cuda_runtime.h>

#ifdef USE_NVTX
#include <nvToolsExt.h>

const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
                           0x0000ffff, 0x00ff0000, 0x00ffffff};
const int num_colors = sizeof(colors) / sizeof(uint32_t);

#define PUSH_RANGE(name, cid)                              \
    {                                                      \
        int color_id = cid;                                \
        color_id = color_id % num_colors;                  \
        nvtxEventAttributes_t eventAttrib = {0};           \
        eventAttrib.version = NVTX_VERSION;                \
        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
        eventAttrib.color = colors[color_id];              \
        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
        eventAttrib.message.ascii = name;                  \
        nvtxRangePushEx(&eventAttrib);                     \
    }
#define POP_RANGE nvtxRangePop();
#else
#define PUSH_RANGE(name, cid)
#define POP_RANGE
#endif

// #############################################################################
// #                         KERNEL REGION BEGIN                               #
// #############################################################################

#define CUDA_RT_CALL(call)                                                                  \
    {                                                                                       \
        cudaError_t cudaStatus = call;                                                      \
        if (cudaSuccess != cudaStatus)                                                      \
            fprintf(stderr,                                                                 \
                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
                    "with "                                                                 \
                    "%s (%d).\n",                                                           \
                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
    }

#ifdef USE_DOUBLE
typedef double real;
#define MPI_REAL_TYPE MPI_DOUBLE
#else
typedef float real;
#define MPI_REAL_TYPE MPI_FLOAT
#endif

//-----------------------------------------------------------------------------
// initialize_boundaries kernel
//-----------------------------------------------------------------------------
// This kernel sets Dirichlet boundary values on the left and right columns of
// the domain (on device arrays a and a_new), imposing the boundary function:
//
//     u(x=0, y)      = sin(2πy)
//     u(x=L_x, y)    = sin(2πy)
//
// Each CUDA thread sets the boundary values for a given row.
//
// Example row structure (iy = row):
//     x=0                     x=nx-1
//      |                        |
//      v                        v
//   +---*--------------------*---+ iy = 0
//   |   |                    |   |
//   +---*--------------------*---+ iy = my_ny-1
//      ^                    ^
//    left                right
//    boundary            boundary
//
// For each iy (row):
//     - a[iy * nx + 0]           <-- left boundary
//     - a[iy * nx + (nx-1)]      <-- right boundary
//     (and similarly in a_new)

__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
                                      const real pi, const int offset, const int nx,
                                      const int my_ny, const int ny) {
    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
        // Set left boundary
        a[iy * nx + 0] = y0;
        // Set right boundary
        a[iy * nx + (nx - 1)] = y0;
        a_new[iy * nx + 0] = y0;
        a_new[iy * nx + (nx - 1)] = y0;
    }
}

void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
                                  const real pi, const int offset, const int nx, const int my_ny,
                                  const int ny) {
    initialize_boundaries<<<my_ny / 128 + 1, 128>>>(a_new, a, pi, offset, nx, my_ny, ny);
    CUDA_RT_CALL(cudaGetLastError());
}

//-----------------------------------------------------------------------------
// Jacobi Iteration CUDA Kernel (Five-point stencil, 2D grid)
//-----------------------------------------------------------------------------
//
// Each CUDA thread operates on one interior grid point (ix, iy) and updates
// a_new[iy * nx + ix] using the values at the four direct neighbours in 'a'
// from the previous Jacobi sweep.
//
//
//     iy-1
//       ^
//       |
//    +------+------+------+
//    |      |      |      |
//    |      | up   |      |
//    |      |(N)   |      |
//    +------+------+------+
//    |      |      |      |
// ix-1<-- W  | (C)  | E --> ix+1
//    |      |      |      |
//    +------+------+------+
//    |      | down |      |
//    |      |(S)   |      |
//    +------+------+------+
//       |
//     iy+1
//
//   - (C): center, this grid point (ix, iy)
//   - (N): north/up,   (ix, iy+1)
//   - (S): south/down, (ix, iy-1)
//   - (E): east/right, (ix+1, iy)
//   - (W): west/left,  (ix-1, iy)
//
// Jacobi update:
//     a_new[iy * nx + ix] = 0.25 * (
//         a[iy * nx + (ix-1)] +  // left
//         a[iy * nx + (ix+1)] +  // right
//         a[(iy-1) * nx + ix] +  // down
//         a[(iy+1) * nx + ix]    // up
//     );
//
// Optionally, each thread accumulates the local squared residue
// for the L2 norm calculation (for convergence checking).

template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
                              real* __restrict__ const l2_norm, const int iy_start,
                              const int iy_end, const int nx, const bool calculate_norm) {
#ifdef HAVE_CUB
    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
        BlockReduce;
    __shared__ typename BlockReduce::TempStorage temp_storage;
#endif  // HAVE_CUB
    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
    real local_l2_norm = 0.0;

    if (iy < iy_end && ix < (nx - 1)) {
        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
        a_new[iy * nx + ix] = new_val;
        if (calculate_norm) {
            real residue = new_val - a[iy * nx + ix];
            local_l2_norm += residue * residue;
        }
    }
    if (calculate_norm) {
#ifdef HAVE_CUB
        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
#else
        atomicAdd(l2_norm, local_l2_norm);
#endif  // HAVE_CUB
    }
}

//-----------------------------------------------------------------------------
// Kernel launcher utility: launch_jacobi_kernel()
//-----------------------------------------------------------------------------
// Launches the 2D Jacobi CUDA kernel over a subdomain.
//
//  Each CUDA block covers a rectangular tile of the local subdomain.
//  The grid is sized to cover all (iy_start:iy_end, ix=1:nx-2) interior points.
//
//  Each thread in each block processes one (ix, iy) grid point in the domain,
//  applies the Jacobi stencil, and accumulates local L2 norm, if requested.

void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
                          const int nx, const bool calculate_norm, cudaStream_t stream) {
    constexpr int dim_block_x = 32;
    constexpr int dim_block_y = 32;
    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x,
                  ((iy_end - iy_start) + dim_block_y - 1) / dim_block_y, 1);
    jacobi_kernel<dim_block_x, dim_block_y><<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, stream>>>(
        a_new, a, l2_norm, iy_start, iy_end, nx, calculate_norm);
    CUDA_RT_CALL(cudaGetLastError());
}

// #############################################################################
// #                          KERNEL REGION END                                #
// #############################################################################

constexpr real tol = 1.0e-8;

const real PI = 2.0 * std::asin(1.0);

void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
                                  const real pi, const int offset, const int nx, const int my_ny,
                                  const int ny);

void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
                          const int nx, const bool calculate_norm, cudaStream_t stream);

double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
                  const int nccheck, const bool print);

template <typename T>
T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
    T argval = default_val;
    char** itr = std::find(begin, end, arg);
    if (itr != end && ++itr != end) {
        std::istringstream inbuf(*itr);
        inbuf >> argval;
    }
    return argval;
}

bool get_arg(char** begin, char** end, const std::string& arg) {
    char** itr = std::find(begin, end, arg);
    if (itr != end) {
        return true;
    }
    return false;
}



int main(int argc, char* argv[]) {

    // =========================================================================
    // MPI Initialization
    // =========================================================================
    int size, rank;
    MPI_CALL(MPI_Init(&argc, &argv));
    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));

    // =========================================================================
    // Command-line Arguments
    // =========================================================================
    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
    const int nccheck  = get_argval<int>(argv, argv + argc, "-nccheck", 1);
    const int nx       = get_argval<int>(argv, argv + argc, "-nx", 16384);
    const int ny       = get_argval<int>(argv, argv + argc, "-ny", 16384);
    const bool csv     = get_arg(argv, argv + argc, "-csv");

    // =========================================================================
    // Local Rank and Device Selection
    // =========================================================================
    // This code gets your local rank on a node 
    int local_rank = -1;
    {
        MPI_Comm local_comm;
        MPI_CALL(MPI_Comm_split_type(
            MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL, &local_comm));
        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));
        MPI_CALL(MPI_Comm_free(&local_comm));
    }
    int num_devices = 0;
    /*
    TODO: Get the available GPU devices into `num_devices` and use it and the current rank to set the active GPU.
        +-----------------------------+
        |  Multi-GPU Device Selection |
        +-----------------------------+

        This block of code determines which GPU device each MPI rank will use.
        The motivation is to distribute MPI ranks across available GPUs, especially when 
        multiple ranks share a node. The strategy used here is "local rank modulo number of devices".

            Node with 4 ranks and 2 GPUs:

            [Node]
            +-----------------------------------------+
            |   Rank 0   Rank 1   Rank 2   Rank 3     |
            |    |         |         |         |      |
            |    v         v         v         v      |
            |  [GPU 0]   [GPU 1]   [GPU 0]   [GPU 1]  |
            +-----------------------------------------+

        - `cudaGetDeviceCount(&num_devices)` queries the number of GPUs present on the node.
        - `cudaSetDevice(local_rank % num_devices)` assigns a GPU to this rank, cycling if more ranks than GPUs.
            For example, if there are 2 GPUs and 4 ranks, local_rank values 0,1,2,3 get mapped as 0,1,0,1.
        - `cudaFree(0)` is a CUDA idiom that initializes the CUDA context on the selected device, avoiding later startup overhead.

        This ensures all ranks operate on GPUs and multiple ranks on a node are distributed as evenly as possible.
    */
    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
    CUDA_RT_CALL(cudaSetDevice(local_rank % num_devices));
    CUDA_RT_CALL(cudaFree(0));

    // =========================================================================
    // Host Memory Allocation and Initial Single GPU Reference
    // =========================================================================
    real* a_ref_h; CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(real)));
    real* a_h;     CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(real)));
    double runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, nccheck, !csv && (0 == rank));

    // =========================================================================
    // Partitioning Problem Across Ranks
    // =========================================================================
    int chunk_size = (ny - 2) / size;
    if (rank == size - 1)
        chunk_size += (ny - 2) % size;

    // =========================================================================
    // Device Memory Allocation
    // =========================================================================
    real* a;     CUDA_RT_CALL(cudaMalloc(&a,     nx * (chunk_size + 2) * sizeof(real)));
    real* a_new; CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(real)));
    CUDA_RT_CALL(cudaMemset(a,     0, nx * (chunk_size + 2) * sizeof(real)));
    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(real)));

    // =========================================================================
    // Index Setup for Local Computation
    // =========================================================================
    int iy_start_global = rank * ((ny - 2) / size) + 1;
    int iy_end_global   = iy_start_global + chunk_size - 1;  // My last index in the global array
    int iy_start        = 1;                                 // My local start index for computation
    int iy_end          = iy_start + chunk_size;             // My local last index

    // =========================================================================
    // Boundary Condition Initialization
    // =========================================================================
    // Set diriclet boundary conditions on left and right boarder
    launch_initialize_boundaries(
        a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny
    );
    CUDA_RT_CALL(cudaDeviceSynchronize());

    // =========================================================================
    // CUDA Streams and Events
    // =========================================================================
    cudaStream_t compute_stream;      CUDA_RT_CALL(cudaStreamCreate(&compute_stream));
    cudaEvent_t  compute_done;        CUDA_RT_CALL(cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming));
    real* l2_norm_d;                  CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
    real* l2_norm_h;                  CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));

    // =========================================================================
    // Print Description (rank 0)
    // =========================================================================
    if (!csv && 0 == rank) {
        printf(
            "Jacobi relaxation: %d iterations on %d x %d mesh with norm check "
            "every %d iterations\n",
            iter_max, ny, nx, nccheck);
    }

    // =========================================================================
    // Jacobi Iteration Loop
    // =========================================================================
    int iter = 0;
    real l2_norm = 1.0;
    bool calculate_norm; // boolean to store whether l2 norm will be calculated in an iteration or not

    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
    double start = MPI_Wtime();
    PUSH_RANGE("Jacobi solve", 0);

    while (l2_norm > tol && iter < iter_max) {
        // -----------------------------------------------------------
        // 1. Zero L2 Norm on Device
        // -----------------------------------------------------------
        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream));

        // -----------------------------------------------------------
        // 2. Decide Norm Calculation for this Iteration
        // -----------------------------------------------------------
        calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0);

        // -----------------------------------------------------------
        // 3. Launch Jacobi Kernel
        // -----------------------------------------------------------
        launch_jacobi_kernel(
            a_new, a, l2_norm_d, iy_start, iy_end, nx, calculate_norm, compute_stream);

        CUDA_RT_CALL(cudaEventRecord(compute_done, compute_stream));

        // -----------------------------------------------------------
        // 4. Optionally Copy L2 Norm from Device to Host
        // -----------------------------------------------------------
        if (calculate_norm) {
            CUDA_RT_CALL(cudaMemcpyAsync(
                l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream));
        }

        // -----------------------------------------------------------
        // 5. Halo Exchange with Neighbors (MPI)
        // -----------------------------------------------------------
        // TODO: Compute top and bottom neighbor, use reflecting/periodic boundaries. This means rank 0 and rank (size-1)
        // exchange data 
        // 
        // Figure out top and bottom MPI neighbors with periodic boundaries.
        //
        //   +---------+         +---------+         +---------+
        //   | Rank 0  |  <--->  | Rank 1  |  <--->  ...  <--->  | Rank (size-1) |
        //   +---------+         +---------+         +---------+
        //      ^                                          |
        //      |__________________________________________|
        //
        // Each rank exchanges data with its top and bottom neighbor:
        //   - The "top" of rank 0 is rank (size-1) (wrap around)
        //   - The "bottom" of rank (size-1) is rank 0 (wrap around)
        //   - Otherwise: top = rank - 1, bottom = rank + 1
        //
        // The calculation:
        const int top    = (rank > 0) ? (rank - 1) : (size - 1);
        const int bottom = (rank + 1) % size;

        // TODO: Use MPI_Sendrecv to exchange the data with the top and bottom neighbors 
        // Use CUDA-aware MPI here, i.e. receive the data directly in a_new on the GPU and send it from there 
        // without manually copying the data. 

        // The first newly calculated row ('iy_start') is sent to the top neigbor and the bottom boundary row
        // (`iy_end`) is received from the bottom process.
        // The last calculated row (`iy_end-1`) is send to the bottom process and the top boundary (`0`) is received from the top
        // Don't forget to synchronize the computation on the GPU before starting the data transfer
        CUDA_RT_CALL(cudaEventSynchronize(compute_done));
        PUSH_RANGE("MPI", 5);

        MPI_CALL(MPI_Sendrecv(
            a_new + iy_start * nx, nx, MPI_REAL_TYPE, top, 0,
            a_new + (iy_end * nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD,
            MPI_STATUS_IGNORE));

        MPI_CALL(MPI_Sendrecv(
            a_new + (iy_end - 1) * nx, nx, MPI_REAL_TYPE, bottom, 0,
            a_new, nx, MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE));
        POP_RANGE;

        // -----------------------------------------------------------
        // 6. L2 Norm Reduction
        // -----------------------------------------------------------
        if (calculate_norm) {
            CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
            MPI_CALL(MPI_Allreduce(
                l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD));
            l2_norm = std::sqrt(l2_norm);

            if (!csv && 0 == rank && (iter % 100) == 0) {
                printf("%5d, %0.6f\n", iter, l2_norm);
            }
        }

        // -----------------------------------------------------------
        // 7. Swap Grids for Next Iteration
        // -----------------------------------------------------------
        std::swap(a_new, a);
        iter++;
    }
    double stop = MPI_Wtime();
    POP_RANGE;

    // =========================================================================
    // Copy Results Back and Validate
    // =========================================================================
    CUDA_RT_CALL(cudaMemcpy(
        a_h + iy_start_global * nx, a + nx,
        std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(real),
        cudaMemcpyDeviceToHost));

    int result_correct = 1;

    for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
                fprintf(stderr,
                        "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
                        "(reference)\n",
                        rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
                result_correct = 0;
            }
        }
    }

    int global_result_correct = 1;
    MPI_CALL(MPI_Allreduce(
        &result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD));
    result_correct = global_result_correct;

    // =========================================================================
    // Report Results
    // =========================================================================
    if (rank == 0 && result_correct) {
        if (csv) {
            printf("mpi, %d, %d, %d, %d, %d, 1, %f, %f\n", nx, ny, iter_max, nccheck, size,
                   (stop - start), runtime_serial);
        } else {
            printf("Num GPUs: %d.\n", size);
            printf(
                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
                "efficiency: %8.2f \n",
                ny, nx, runtime_serial, size, (stop - start),
                runtime_serial / (stop - start),
                runtime_serial / (size * (stop - start)) * 100);
        }
    }

    // =========================================================================
    // Cleanup (CUDA/MPI)
    // =========================================================================
    CUDA_RT_CALL(cudaEventDestroy(compute_done));
    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));
    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
    CUDA_RT_CALL(cudaFree(l2_norm_d));
    CUDA_RT_CALL(cudaFree(a_new));
    CUDA_RT_CALL(cudaFree(a));
    CUDA_RT_CALL(cudaFreeHost(a_h));
    CUDA_RT_CALL(cudaFreeHost(a_ref_h));

    MPI_CALL(MPI_Finalize());
    return (result_correct == 1) ? 0 : 1;
}


double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
                  const int nccheck, const bool print) {
    // ----------------------------------------------------------------------
    // single_gpu(): Solves the Jacobi relaxation problem on a single GPU
    // Organization of this function:
    // 1. Declaration of device/host pointers and CUDA streams/events
    // 2. Allocation and initialization of memory
    // 3. Setting boundary conditions and synchronizing device
    // 4. Creation of CUDA streams and events for overlapping compute/comm
    // 5. (Optional) Print problem information
    // 6. Main Jacobi iteration loop:
    //    - Zero L2 norm buffer asynchronously
    //    - Make computation stream wait for completion of "push" events
    //    - Launch kernel (optionally computing the L2 norm)
    //    - (If needed) Async copy L2 norm back to host
    //    - Apply periodic boundary conditions with device-to-device copies
    //    - Synchronize streams/events, record, update L2 norm, and print every so many iterations
    //    - Swap pointers and increment iteration counter
    // 7. Record timing, copy result to host, and cleanup all resources
    // 8. Return elapsed time
    // ----------------------------------------------------------------------

    // 1. Declaration of pointers, streams, and events
    real* a;
    real* a_new;

    cudaStream_t compute_stream;
    cudaStream_t push_top_stream;
    cudaStream_t push_bottom_stream;
    cudaEvent_t compute_done;
    cudaEvent_t push_top_done;
    cudaEvent_t push_bottom_done;

    real* l2_norm_d;
    real* l2_norm_h;

    int iy_start = 1;
    int iy_end = (ny - 1);

    // 2. Allocation of device memory for working arrays
    CUDA_RT_CALL(cudaMalloc(&a, nx * ny * sizeof(real)));
    CUDA_RT_CALL(cudaMalloc(&a_new, nx * ny * sizeof(real)));

    // Initialize working arrays to zero
    CUDA_RT_CALL(cudaMemset(a, 0, nx * ny * sizeof(real)));
    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * ny * sizeof(real)));

    // 3. Set Dirichlet boundary conditions (left and right boundaries)
    launch_initialize_boundaries(a, a_new, PI, 0, nx, ny, ny);
    CUDA_RT_CALL(cudaDeviceSynchronize());

    // 4. Create CUDA streams and events for compute and data movement overlap
    CUDA_RT_CALL(cudaStreamCreate(&compute_stream));
    CUDA_RT_CALL(cudaStreamCreate(&push_top_stream));
    CUDA_RT_CALL(cudaStreamCreate(&push_bottom_stream));
    CUDA_RT_CALL(cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming));
    CUDA_RT_CALL(cudaEventCreateWithFlags(&push_top_done, cudaEventDisableTiming));
    CUDA_RT_CALL(cudaEventCreateWithFlags(&push_bottom_done, cudaEventDisableTiming));

    // Auxiliary memory for L2 norm computation
    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));

    CUDA_RT_CALL(cudaDeviceSynchronize());

    // 5. Print problem information if requested
    if (print)
        printf(
            "Single GPU jacobi relaxation: %d iterations on %d x %d mesh with "
            "norm "
            "check every %d iterations\n",
            iter_max, ny, nx, nccheck);

    // 6. Main Jacobi iteration loop
    int iter = 0;
    real l2_norm = 1.0;
    bool calculate_norm;

    double start = MPI_Wtime();
    PUSH_RANGE("Jacobi solve", 0)
    while (l2_norm > tol && iter < iter_max) {
        // Zero the device L2 norm for this iteration
        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream));

        // Ensure compute stream waits for previous boundary updates
        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, push_top_done, 0));
        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, push_bottom_done, 0));

        // Only calculate norm periodically for efficiency
        calculate_norm = (iter % nccheck) == 0 || (iter % 100) == 0;
        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, iy_end, nx, calculate_norm,
                             compute_stream);
        CUDA_RT_CALL(cudaEventRecord(compute_done, compute_stream));

        // If requested, asynchronously copy norm result back to host
        if (calculate_norm) {
            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost,
                                         compute_stream));
        }

        // 6a. Apply periodic boundary conditions via device-device copies

        // Update top boundary after compute
        CUDA_RT_CALL(cudaStreamWaitEvent(push_top_stream, compute_done, 0));
        CUDA_RT_CALL(cudaMemcpyAsync(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real),
                                     cudaMemcpyDeviceToDevice, push_top_stream));
        CUDA_RT_CALL(cudaEventRecord(push_top_done, push_top_stream));

        // Update bottom boundary after compute
        CUDA_RT_CALL(cudaStreamWaitEvent(push_bottom_stream, compute_done, 0));
        CUDA_RT_CALL(cudaMemcpyAsync(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real),
                                     cudaMemcpyDeviceToDevice, compute_stream));
        CUDA_RT_CALL(cudaEventRecord(push_bottom_done, push_bottom_stream));

        // 6b. Finalize L2 norm calculation
        if (calculate_norm) {
            CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
            l2_norm = *l2_norm_h;
            l2_norm = std::sqrt(l2_norm);
            if (print && (iter % 100) == 0) printf("%5d, %0.6f\n", iter, l2_norm);
        }

        // Swap roles of input/output arrays for next iteration
        std::swap(a_new, a);
        iter++;
    }
    POP_RANGE
    double stop = MPI_Wtime();

    // 7. Copy the final result back to host and cleanup resources
    CUDA_RT_CALL(cudaMemcpy(a_ref_h, a, nx * ny * sizeof(real), cudaMemcpyDeviceToHost));

    CUDA_RT_CALL(cudaEventDestroy(push_bottom_done));
    CUDA_RT_CALL(cudaEventDestroy(push_top_done));
    CUDA_RT_CALL(cudaEventDestroy(compute_done));
    CUDA_RT_CALL(cudaStreamDestroy(push_bottom_stream));
    CUDA_RT_CALL(cudaStreamDestroy(push_top_stream));
    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));

    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
    CUDA_RT_CALL(cudaFree(l2_norm_d));

    CUDA_RT_CALL(cudaFree(a_new));
    CUDA_RT_CALL(cudaFree(a));
    // 8. Return elapsed time
    return (stop - start);
}
Ⓒ [wood2@jwlogin24 solutions]$ make 
nvcc -DHAVE_CUB -dc -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt \
    -gencode arch=compute_80,code=sm_80 \
    -gencode arch=compute_80,code=compute_80 \
    -std=c++14  \
    -I/p/software/default/stages/2025/software/OpenMPI/5.0.5-GCC-13.3.0/include \
    jacobi.cu -c -o jacobi.o

nvcc \
    -gencode arch=compute_80,code=sm_80 \
    -gencode arch=compute_80,code=compute_80 \
    jacobi.o -o jacobi \
    -ccbin=mpic++ \
    -L/p/software/default/stages/2025/software/NVSHMEM/3.1.7-gompi-2024a-CUDA-12 \
    -L/p/software/default/stages/2025/software/OpenMPI/5.0.5-GCC-13.3.0/lib -lmpi \
    -L/p/software/juwelsbooster/stages/2025/software/CUDA/12/lib64 \
    -lcuda -lcudart -lnvToolsExt
Ⓒ [wood2@jwlogin24 solutions]$ make run
salloc --reservation sc25-multigpu-tut --partition booster --gres=gpu:4 --disable-dcgm --time 0:10:00 srun --cpu-bind=sockets --unbuffered -n 4 ./jacobi
salloc: Pending job allocation 12864084
salloc: job 12864084 queued and waiting for resources
salloc: job 12864084 has been allocated resources
salloc: Granted job allocation 12864084
salloc: Waiting for resource configuration
salloc: Nodes jwb0291 are ready for job
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
Jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999023
  100, 0.897983
  200, 0.535685
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235510
  700, 0.209829
  800, 0.189854
  900, 0.173818
Num GPUs: 4.
16384x16384: 1 GPU:   3.2828 s, 4 GPUs:   0.9649 s, speedup:     3.40, efficiency:    85.06 
salloc: Relinquishing job allocation 12864084
salloc: Job allocation 12864084 has been revoked.
Ⓒ [wood2@jwlogin24 solutions]$ salloc --reservation sc25-multigpu-tut \
       --partition booster \
       --nodes=4 \
       --gres=gpu:4 \
       --disable-dcgm \
       --time 0:10:00 \
       srun --cpu-bind=sockets --unbuffered -n 16 ./jacobi
salloc: Granted job allocation 12864109
salloc: Waiting for resource configuration
salloc: Nodes jwb[0158,0163,0411-0412] are ready for job
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
Jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999025
  100, 0.897983
  200, 0.535685
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235510
  700, 0.209829
  800, 0.189854
  900, 0.173818
Num GPUs: 16.
16384x16384: 1 GPU:   3.2875 s, 16 GPUs:   0.3454 s, speedup:     9.52, efficiency:    59.48 
salloc: Relinquishing job allocation 12864109
salloc: Job allocation 12864109 has been revoked.
Ⓒ [wood2@jwlogin24 solutions]$ salloc --reservation sc25-multigpu-tut \
       --partition booster \
       --nodes=4 \
       --gres=gpu:4 \
       --disable-dcgm \
       --time 0:10:00 \
       srun --cpu-bind=sockets --unbuffered -n 16 \
            nsys profile --trace=mpi,cuda,nvtx \
            -o jacobi.03H-sol__20251116-2050.%q{SLURM_PROCID} \
            ./jacobi -niter 10
salloc: Granted job allocation 12864111
salloc: Waiting for resource configuration
salloc: Nodes jwb[0158,0163,0411-0412] are ready for job
Single GPU jacobi relaxation: 10 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
Jacobi relaxation: 10 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999023
Num GPUs: 16.
16384x16384: 1 GPU:   0.0331 s, 16 GPUs:   0.0850 s, speedup:     0.39, efficiency:     2.44 
Collecting data...
Generating '/tmp/nsys-report-fe53.qdstrm'
[1/1] [========================100%] jacobi.03H-sol__20251116-2050.7.nsys-repCollecting data...
Generating '/tmp/nsys-report-e3c0.qdstrm'
[1/1] [0%                          ] jacobi.03H-sol__20251116-2050.6.nsys-repCollecting data...
Generating '/tmp/nsys-report-ad51.qdstrm'
[1/1] [0%                          ] jacobi.03H-sol__20251116-2050.4.nsys-repCollecting data...
Generating '/tmp/nsys-report-dff2.qdstrm'

NCCL

In the original MPI version, each process (rank) talks to other ranks via the CPU: you call MPI_Sendrecv on the host, and the MPI library figures out how to move data between GPUs (possibly using CUDA-aware MPI, maybe staging through host, etc.). The mental model is “CPU processes exchanging messages,” with GPUs acting as accelerators that happen to store the data.

In the NCCL version, the focus shifts: communication is now a GPU operation scheduled on a CUDA stream, using ncclSend and ncclRecv. NCCL assumes your buffers are on the GPU and tries to push bits directly over NVLink / PCIe / InfiniBand in a topology-aware way. The mental model becomes “GPUs exchanging data directly,” with the CPU mostly orchestrating streams and communicators.

MPI version (CPU-centric):
--------------------------
  Rank 0 (CPU)  <---- MPI ---->  Rank 1 (CPU)
     |                              |
   GPU 0                          GPU 1

NCCL version (GPU-centric):
---------------------------
  Rank 0 (CPU)                Rank 1 (CPU)
     |                             |
  ncclComm                      ncclComm
     |                             |
   GPU 0  <==== NCCL (GPU↔GPU) ===> GPU 1

With pure MPI, initialization is simple: all ranks call MPI_Init, then query MPI_Comm_rank and MPI_Comm_size. This gives you a communicator (MPI_COMM_WORLD) for all ranks. In the NCCL variant, you still typically rely on MPI at startup, but only to distribute a NCCL unique ID. Rank 0 calls ncclGetUniqueId, then broadcasts that ID via MPI_Bcast. Each rank then calls ncclCommInitRank(size, id, rank) to create a NCCL communicator that mirrors MPI_COMM_WORLD but is specifically for GPU communication.

So instead of “just MPI,” you now have two layers: MPI for bootstrap, NCCL for data movement.

Step 1: MPI world
-----------------
All ranks:
  MPI_Init(...)
  MPI_Comm_rank(...)
  MPI_Comm_size(...)

Step 2: NCCL ID distribution
----------------------------
Rank 0:
  ncclGetUniqueId(&id)

All ranks:
  MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD)

Step 3: NCCL communicator
-------------------------
All ranks:
  ncclCommInitRank(&nccl_comm, size, id, rank)

In the MPI Jacobi solver, each rank updates its local chunk of the 2D domain, then exchanges halo rows with its neighbors using MPI_Sendrecv. This call is CPU-side, blocking, and tag-based: you specify source/dest ranks, tags, and MPI handles matching for you. One call can simultaneously send one row and receive another, which is convenient for symmetric halo exchange.

NCCL does not have a Sendrecv and does not use tags. Instead, you use separate ncclSend and ncclRecv calls and rely on call ordering plus ranks for matching. You wrap related operations in ncclGroupStart() / ncclGroupEnd()` so NCCL can set up all transfers and avoid deadlocks. Also, all NCCL ops take a CUDA stream, so your halo exchange becomes part of the GPU execution graph rather than a pure host blocking call.

Ranks arranged in a ring:
    0  <->  1  <->  2  <-> ... <->  size-1
   (top)       (middle ranks)          (bottom)

Per iteration halo exchange:

MPI version:
------------
Rank r:
  MPI_Sendrecv( send first interior row to top,
                recv bottom halo from bottom )
  MPI_Sendrecv( send last interior row to bottom,
                recv top halo from top )

NCCL version:
-------------
Rank r:
  ncclGroupStart();
    ncclSend(send_top_row,   nx, ncclFloat, top,    comm, stream);
    ncclRecv(recv_bottom_row,nx, ncclFloat, bottom, comm, stream);
    ncclSend(send_bottom_row,nx, ncclFloat, bottom, comm, stream);
    ncclRecv(recv_top_row,   nx, ncclFloat, top,    comm, stream);
  ncclGroupEnd();

With MPI, the typical pattern is:

Even if your MPI is CUDA-aware and uses GPUDirect RDMA, the call semantics remain host-blocking: the CPU can’t move on until MPI says the transfer is done (unless you explicitly use MPI_Isend/Irecv and manage requests).

With NCCL, every ncclSend / ncclRecv is enqueued on a CUDA stream. The calls are non-blocking from the CPU’s perspective: they return once the operation is queued, not completed. The actual data movement is driven by the GPU and NIC as part of the stream’s work. You synchronize later with cudaStreamSynchronize or by sequencing kernels and comm ops carefully.

MPI (blocking)
--------------
CPU:
  [launch kernel] -> [wait for kernel] -> [MPI_Sendrecv (block)]
GPU:
  [compute Jacobi] ----- idle while MPI is active -----> [next kernel]

NCCL (stream-based)
-------------------
CPU:
  [launch kernel] -> [enqueue ncclSend/Recv] -> [launch more kernels] -> [sync later]
GPU (stream):
  [compute Jacobi] -> [NCCL halo transfer] -> [next compute]
Network/NIC:
  <---------- data in flight while CPU and GPU can overlap work ---------->

NCCL involves GPU kernels, so projects often separate host C++ (.cpp) and GPU kernels (kernels.cu) to work nicely with CUDA’s build model and keep the code organized. MPI examples are often pure CPU, so they don’t need that split.

/* Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
#include <algorithm>
#include <cmath>
#include <cstdio>
#include <iostream>
#include <sstream>

#include <mpi.h>
#define MPI_CALL(call)                                                                \
    {                                                                                 \
        int mpi_status = call;                                                        \
        if (0 != mpi_status) {                                                        \
            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
            int mpi_error_string_length = 0;                                          \
            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
            if (NULL != mpi_error_string)                                             \
                fprintf(stderr,                                                       \
                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
                        "with %s "                                                    \
                        "(%d).\n",                                                    \
                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
            else                                                                      \
                fprintf(stderr,                                                       \
                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
                        "with %d.\n",                                                 \
                        #call, __LINE__, __FILE__, mpi_status);                       \
        }                                                                             \
    }

#include <cuda_runtime.h>

#ifdef USE_NVTX
#include <nvToolsExt.h>

const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
                           0x0000ffff, 0x00ff0000, 0x00ffffff};
const int num_colors = sizeof(colors) / sizeof(uint32_t);

#define PUSH_RANGE(name, cid)                              \
    {                                                      \
        int color_id = cid;                                \
        color_id = color_id % num_colors;                  \
        nvtxEventAttributes_t eventAttrib = {0};           \
        eventAttrib.version = NVTX_VERSION;                \
        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
        eventAttrib.color = colors[color_id];              \
        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
        eventAttrib.message.ascii = name;                  \
        nvtxRangePushEx(&eventAttrib);                     \
    }
#define POP_RANGE nvtxRangePop();
#else
#define PUSH_RANGE(name, cid)
#define POP_RANGE
#endif

#define CUDA_RT_CALL(call)                                                                  \
    {                                                                                       \
        cudaError_t cudaStatus = call;                                                      \
        if (cudaSuccess != cudaStatus)                                                      \
            fprintf(stderr,                                                                 \
                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
                    "with "                                                                 \
                    "%s (%d).\n",                                                           \
                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
    }

//TODO: include NCCL headers
#include <nccl.h>
#ifdef NCCL_VERSION
#define NCCL_VERSION_UB NCCL_VERSION(2,19,1)
#define NCCL_UB_SUPPORT NCCL_VERSION_CODE >= NCCL_VERSION_UB
#else
#define NCCL_UB_SUPPORT 0
#endif


#define NCCL_CALL(call)                                                                     \
    {                                                                                       \
        ncclResult_t  ncclStatus = call;                                                    \
        if (ncclSuccess != ncclStatus)                                                      \
            fprintf(stderr,                                                                 \
                    "ERROR: NCCL call \"%s\" in line %d of file %s failed "                 \
                    "with "                                                                 \
                    "%s (%d).\n",                                                           \
                    #call, __LINE__, __FILE__, ncclGetErrorString(ncclStatus), ncclStatus); \
    }

#ifdef USE_DOUBLE
typedef double real;
#define MPI_REAL_TYPE MPI_DOUBLE
#define NCCL_REAL_TYPE ncclDouble
#else
typedef float real;
#define MPI_REAL_TYPE MPI_FLOAT
#define NCCL_REAL_TYPE ncclFloat
#endif

constexpr real tol = 1.0e-8;

const real PI = 2.0 * std::asin(1.0);

void launch_initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
                                  const real pi, const int offset, const int nx, const int my_ny,
                                  const int ny);

void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
                          const int nx, const bool calculate_norm, cudaStream_t stream);

double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
                  const int nccheck, const bool print);

template <typename T>
T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
    T argval = default_val;
    char** itr = std::find(begin, end, arg);
    if (itr != end && ++itr != end) {
        std::istringstream inbuf(*itr);
        inbuf >> argval;
    }
    return argval;
}

bool get_arg(char** begin, char** end, const std::string& arg) {
    char** itr = std::find(begin, end, arg);
    if (itr != end) {
        return true;
    }
    return false;
}

int main(int argc, char* argv[]) {
    //----------------------------------------------------------------------------
    //                      PROGRAM ENTRY AND SETUP REGION                       
    //----------------------------------------------------------------------------
    MPI_CALL(MPI_Init(&argc, &argv));
    int rank;
    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
    int size;
    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));

    //TODO: Create a ncclUniqueId, have rank 0 initize it by using the appropriate runtime call,
    //      and remember to broadcast it to all ranks using MPI
    //HINT: Best practice: wrap your nccl calls using the above NCCL_CALL macro to catch runtime errors early.
    //
    //--- NCCL unique id creation and distribution --
    //      Rank 0              Other Ranks
    //    +--------+           +--------+
    //    |  (0)   |           | (1..N) |
    //    +--------+           +--------+
    //        |                    |
    //    [ncclGetUniqueId]        |
    //        |     nccl_uid       |
    //        +---------+----------+
    //                  |
    //           [MPI_Bcast]
    //                  |
    //         All ranks now have the
    //         same NCCL unique ID!
    // --------------------------------------------
    ncclUniqueId nccl_uid;
    if (rank == 0)
        NCCL_CALL(ncclGetUniqueId(&nccl_uid));     // Only rank 0 generates a new NCCL UID
    MPI_CALL(MPI_Bcast(&nccl_uid, sizeof(ncclUniqueId), MPI_BYTE, 0, MPI_COMM_WORLD)); // Share with all ranks

    //--- Command line arguments ---------------------------------------------------//
    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
    const int nccheck = get_argval<int>(argv, argv + argc, "-nccheck", 1);
    const int nx = get_argval<int>(argv, argv + argc, "-nx", 16384);
    const int ny = get_argval<int>(argv, argv + argc, "-ny", 16384);
    const bool csv = get_arg(argv, argv + argc, "-csv");
    bool user_buffer_reg = get_arg(argv, argv + argc, "-user_buffer_reg");
#if NCCL_UB_SUPPORT == 0
    if (user_buffer_reg) {
        fprintf(stderr,"WARNING: Ignoring -user_buffer_reg, required NCCL APIs are provided by NCCL 2.19.1 or later.\n");
        user_buffer_reg = false;
    }
#endif //NCCL_UB_SUPPORT == 0

    //--- Get local_rank for NUMA and GPU locality --------------------------------//
    // Get the local rank of this process among processes on the same node.
    // This is typically used for assigning GPUs in a multi-GPU node setting.
    //
    // Illustration (for 4 total MPI ranks, 2 nodes, 2 ranks/node):
    //
    //       Node 0                Node 1
    //   +---------------+    +----------------+
    //   | rank 0 (local 0) |  | rank 2 (local 0) |
    //   | rank 1 (local 1) |  | rank 3 (local 1) |
    //   +---------------+    +----------------+
    //
    //   => On Node 0: local_rank 0 for rank 0, local_rank 1 for rank 1,
    //      On Node 1: local_rank 0 for rank 2, local_rank 1 for rank 3
    //
    int local_rank = -1;
    {
        MPI_Comm local_comm;
        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
                                     &local_comm));
        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank)); // assign local_rank (unique per node)
        MPI_CALL(MPI_Comm_free(&local_comm));
    }

    //--- CUDA runtime GPU setup ---------------------------------------------------//
    int num_devices = 0;
    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));
    CUDA_RT_CALL(cudaSetDevice(local_rank%num_devices));
    CUDA_RT_CALL(cudaFree(0));

    //TODO: Create a communicator (ncclComm_t), initialize it (ncclCommInitRank)
    //--- NCCL Communicator: Relationship to Other Components ----------------------//
    //
    // The NCCL communicator (`ncclComm_t`) forms the backbone of GPU-to-GPU communication
    // in distributed programs. But it operates in concert with several other components:
    //
    //   +---------------------+       +-------------+       +-------------+
    //   |   MPI Process (CPU) |<----->| NCCL Comm   |<----->| CUDA Stream |
    //   +---------------------+       +-------------+       +-------------+
    //             |                                        /        |
    //       [MPI_Init, etc]                 [Used for NCCL        [CUDA kernels,
    //             |                          collectives]         memory ops]
    //             v                                        \        v
    //    +---------------------+                    +--------------------+
    //    |   ncclUniqueId      |<------------------>|   Device Buffers   |
    //    +---------------------+   (shared via MPI) +--------------------+
    //
    // Legend:
    // - MPI Process: Orchestrates rank assignment, startup, and distributes the `ncclUniqueId`.
    // - ncclUniqueId: An opaque connection token generated on one rank and broadcast to all
    //   via MPI, so every participant can join the same NCCL communicator.
    // - NCCL Comm: The communicator state that knows about all participants, devices, and
    //   orchestrates effective use of fabric topology.
    // - CUDA Stream: CUDA work queues where NCCL operations are enqueued along with compute.
    // - Device Buffers: The GPU-resident memory regions that are read/written by NCCL comms.
    //
    // Flow:
    // 1. MPI sets up the processes and shares the ncclUniqueId. All processes execute:
    //
    //    - Rank 0: ncclGetUniqueId(&nccl_uid)
    //    - All:    MPI_Bcast(&nccl_uid, ...)
    //    - Each:   ncclCommInitRank(&comm, N, nccl_uid, myrank)
    //
    // 2. Each rank configures its GPU (via CUDA), and instantiates a CUDA stream.
    // 3. Communication operations (e.g., ncclSend/Recv or collectives) are issued
    //    on that CUDA stream and are tracked/brokered by the NCCL communicator.
    // 4. NCCL operations access device memory directly using these communicator and
    //    stream associations, enabling highly parallel, overlapping compute and comm.
    //
    // This integration can be visualized:
    //
    //     MPI Process               NCCL Comm         CUDA Stream            GPU Memory
    //  +----------------+        +------------+     +-------------+      +------------------+
    //  |   Rank 0,1..N  |------->|  Rank info |---->| Stream(s)   |----->| send/recv buffers|
    //  | ("world" view) |  |     | ncclUniqueId|     +-------------+      +------------------+
    //  +----------------+  |     +------------+
    //                      |
    //  [ MPI_Bcast, etc. ] |
    //                      v
    //      +---------------------+
    //      |    ncclUniqueId     |
    //      +---------------------+
    //
    // All rank-specific communicators thus become the link between multi-process orchestration (MPI),
    // device context (CUDA), user buffers, and network hardware, providing scalable, high-performance
    // data movement on and between GPUs.
    ncclComm_t nccl_comm;
    NCCL_CALL(ncclCommInitRank(&nccl_comm, size, nccl_uid, rank));
    int nccl_version = 0;
    NCCL_CALL(ncclGetVersion(&nccl_version));
    if ( nccl_version < 2800 ) {
        fprintf(stderr,"ERROR NCCL 2.8 or newer is required.\n");
        NCCL_CALL(ncclCommDestroy(nccl_comm));
        MPI_CALL(MPI_Finalize());
        return 1;
    }

    //----------------------------------------------------------------------------
    //                        HOST BUFFERS AND SINGLE GPU REF                    
    //----------------------------------------------------------------------------
    // Illustration: Host buffers and global memory layout for single GPU reference
    //
    // Global grid (host) memory layout (page-locked via cudaMallocHost):
    //
    //     a_ref_h           a_h
    //      |                |
    //      v                v
    //   +-----------------------------------------------------------+
    //   |                    nx * ny entries                        |
    //   +-----------------------------------------------------------+
    //
    // Both a_ref_h and a_h hold the entire global problem grid (not partitioned by rank)
    //    - Lay out as a contiguous 1D array of size (ny rows) * (nx columns)
    //
    // 2D logical grid mapping (row-major):
    //        Columns (nx):   j = 0  1  ...      nx-1
    //     Rows (ny):
    //     i = 0            [  0        ...      nx-1   ]
    //     i = 1            [  nx       ...    2*nx-1   ]
    //      ...             [  ...                ...   ]
    //     i = ny-1         [ (ny-1)*nx ...   (ny*nx)-1 ]
    //
    // For a given (i,j):  index = i*nx + j
    //
    // Purpose:
    //   - a_ref_h:   Reference ("gold") solution for entire grid, used to check parallel results.
    //   - a_h:       Scratch buffer, same structure, for temporary or alternate computation.
    //
    // The serial reference run:
    //     single_gpu(nx, ny, iter_max, a_ref_h, ...)
    // operates on the *global* problem array in these buffers.
    real* a_ref_h;
    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(real)));
    real* a_h;
    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(real)));
    double runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, nccheck, !csv && (0 == rank));

    //----------------------------------------------------------------------------
    //                        DOMAIN DECOMPOSITION REGION                        
    //----------------------------------------------------------------------------
    // ny - 2 interior rows are distributed amongst `size` ranks so each rank gets
    //   either (ny - 2) / size or (ny - 2) / size + 1 rows. This optimizes load balancing
    //   if (ny - 2) % size != 0.
    //
    //                              Global grid memory layout
    // ┌───────────────────────────────────────────────────────────────────────────┐
    // │ [Boundary] [Interior: rows 1 ... ny-2]                [Boundary]          │
    // │     0          1   ...    ny-2                           ny-1             │
    // │     |          |           |                               |              │
    // │     v          v           v                               v              │
    // │ +---------------------------------------------------------------+         │
    // │ |               nx * ny entries (row-major)                     |         │
    // │ +---------------------------------------------------------------+         │
    // └───────────────────────────────────────────────────────────────────────────┘
    //
    // Each rank owns a consecutive block of the interior rows (see below).
    //
    //                              Device buffer (per rank)
    //                                                  ┌────────────────────────────┐
    //         Local buffer indices:     0     1      ... chunk_size   chunk_size+1  │
    //                                   |     |          |             |            │
    //                                   v     v          v             v            │
    // Row meaning:                   [halo][1st row] ... [last][halo]               │
    //                                 ↑        |           |       ↓                │
    //                          recv from  local comp.   local    recv from          │
    //                        upper neigh             comp.     lower neigh          │
    //                                                  or boundary                  │
    // ┌─────────────────────────────────────────────────────────────────────────────┐
    // │       [halo]   local rows (computation)     [halo]                          │
    // │        │            │      ...    │           │                             │
    // │        0            1      ...  chunk_size chunk_size+1                     │
    // │ <-------------------------------- nx elements per row --------------------> │
    // └─────────────────────────────────────────────────────────────────────────────┘
    //
    int chunk_size;
    int chunk_size_low = (ny - 2) / size;
    int chunk_size_high = chunk_size_low + 1;
    // Calculate how many ranks get the smaller chunk size:
    //   num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
    //   This ensures exactly (ny - 2) rows are assigned among all ranks.
    int num_ranks_low = size * chunk_size_low + size -
                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
    if (rank < num_ranks_low)
        chunk_size = chunk_size_low;
    else
        chunk_size = chunk_size_high;

    //----------------------------------------------------------------------------
    //                             DEVICE ALLOCATION                             
    //----------------------------------------------------------------------------
    // Device allocation: Mapping to per-rank buffer memory layout and how the math works.
    //
    // Each rank allocates its own section of device memory for `a` and `a_new`.
    // Each device buffer holds its chunk of the global grid's interior rows, plus two HALO rows for neighbor exchange.
    //   - Both arrays have (chunk_size + 2) rows, each of nx reals (columns).
    //   - The math:
    //       total_rows_per_rank = chunk_size
    //       halo_rows = 2     // (one at top, one at bottom)
    //       buffer_rows = chunk_size + halo_rows
    //       buffer_size_in_bytes = nx * (chunk_size + 2) * sizeof(real)
    //   - This ensures that per-rank local memory can hold computation rows plus the halos needed for neighbor comms.
    //
    //     local device buffer for one rank:
    //
    //     Row in buffer:     | 0     | 1          | ... | chunk_size     | chunk_size+1    |
    //     Meaning:           | [halo]| [first]    | ... | [last]         | [halo]          |
    //                        | (top) | interior   |     | interior       | (bottom)        |
    //
    //     Global grid mapping:
    //        ^           chunk_size rows           ^
    //     [boundary] [interior rows for rank] [boundary]
    //
    //     Memory view for one buffer (nx elements per row):
    //
    //     +------------------+-------------------+---------+-------------------+------------------+
    //     | a[0:nx-1]        | a[nx:2nx-1]       |  ...    | a[N: N+nx-1]      | a[M: M+nx-1]     |
    //     +------------------+-------------------+---------+-------------------+------------------+
    //        top halo    1st interior           ...      last interior       bottom halo
    //
    //     Where:
    //         - N = chunk_size * nx
    //         - M = (chunk_size+1) * nx
    //
    // Math Recap:
    // -------
    //   If nx = 1024, chunk_size = 2000;
    //   - Total buffer elements per array = nx * (chunk_size + 2) = 1024 * 2002 = 2,048,448
    //   - Bytes per buffer per rank = 2,048,448 * sizeof(real)
    //   - Both a[] and a_new[] are identically sized.
    //
    //   You access buffer rows as:
    //     a[ row * nx + col ]  for  0 <= row < chunk_size+2, 0 <= col < nx
    //   - Row 0:      top halo   <-- receives from upper neighbor or BC
    //   - Rows 1 .. chunk_size: owned interior rows
    //   - Row chunk_size+1: bottom halo <-- receives from lower neighbor or BC
    //
    real* a;
    real* a_new;
#if NCCL_UB_SUPPORT
    void* a_reg_handle;
    void* a_new_reg_handle;
    if (user_buffer_reg) {
    //TODO: Allocate the memory with ncclMemAlloc and register it for the commmunicatior
        NCCL_CALL(ncclMemAlloc( (void**) &a    , nx * (chunk_size + 2) * sizeof(real)));
        NCCL_CALL(ncclMemAlloc( (void**) &a_new, nx * (chunk_size + 2) * sizeof(real)));
        NCCL_CALL(ncclCommRegister(nccl_comm, a    , nx * (chunk_size + 2) * sizeof(real), &a_reg_handle));
        NCCL_CALL(ncclCommRegister(nccl_comm, a_new, nx * (chunk_size + 2) * sizeof(real), &a_new_reg_handle));
        if ( nccl_version < 22304 ) {
            fprintf(stderr,"WARNING: -user_buffer_reg available, but Jacobi communication pattern needs NCCL 2.23.4 or later.\n");
        }
    }
    else
#endif //NCCL_UB_SUPPORT
    {
    CUDA_RT_CALL(cudaMalloc(&a, nx * (chunk_size + 2) * sizeof(real)));
    CUDA_RT_CALL(cudaMalloc(&a_new, nx * (chunk_size + 2) * sizeof(real)));
    }
    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(real)));
    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(real)));

    // Calculate local domain boundaries
    int iy_start_global;  // My start index in the global array
    if (rank < num_ranks_low) {
        iy_start_global = rank * chunk_size_low + 1;
    } else {
        iy_start_global =
            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
    }
    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array

    int iy_start = 1;
    int iy_end = iy_start + chunk_size;

    // Set diriclet boundary conditions on left and right boarder
    launch_initialize_boundaries(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
    CUDA_RT_CALL(cudaDeviceSynchronize());

    //----------------------------------------------------------------------------
    //                             STREAMS & EVENTS REGION                       
    //----------------------------------------------------------------------------
    //   Illustration: How CUDA Streams & Events interact with the program
    // -----------------------------------------------------------------------
    //
    //  +-----------------+      +----------------+       +------------------------+
    //  |     Host CPU    | <--> | CUDA Streams   | <---->| Device (GPU) Kernels/  |
    //  |   (main thread) |      | (compute/push) |       |  NCCL, memory ops      |
    //  +-----------------+      +----------------+       +------------------------+
    //          |                          |                        |
    //          |    cudaStream_t (handles)|                        |
    //          |------------------------->|                        |
    //          |      cudaLaunchKernel/cudaMemcpyAsync             |
    //          |------------------------->|                        |
    //          |                          |----> [Compute]         |
    //          |                          |----> [Comm]            |
    //          |    cudaEvent_t (handles) |                        |
    //          |<----- Events ----------->|                        |
    //          |   (sync, order, trigger) |                        |
    //
    //  - compute_stream:  where main GPU compute (Jacobi, etc.) is enqueued
    //  - push_stream:     for high-priority communication/completion tasks
    //  - Events (push_prep_done, push_done, reset_l2norm_done): for
    //    synchronization and precise ordering between streams and with host
    // -----------------------------------------------------------------------

    int leastPriority = 0;
    int greatestPriority = leastPriority;
    CUDA_RT_CALL(cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority));
    cudaStream_t compute_stream;
    CUDA_RT_CALL(cudaStreamCreateWithPriority(&compute_stream, cudaStreamDefault, leastPriority));
    cudaStream_t push_stream;
    CUDA_RT_CALL(
        cudaStreamCreateWithPriority(&push_stream, cudaStreamDefault, greatestPriority));

    cudaEvent_t push_prep_done;
    CUDA_RT_CALL(cudaEventCreateWithFlags(&push_prep_done, cudaEventDisableTiming));
    cudaEvent_t push_done;
    CUDA_RT_CALL(cudaEventCreateWithFlags(&push_done, cudaEventDisableTiming));
    cudaEvent_t reset_l2norm_done;
    CUDA_RT_CALL(cudaEventCreateWithFlags(&reset_l2norm_done, cudaEventDisableTiming));

    real* l2_norm_d;
    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
    real* l2_norm_h;
    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));

    //----------------------------------------------------------------------------
    //                              NCCL WARMUP REGION                           
    //----------------------------------------------------------------------------
    //TODO: Rename range
    //  Illustration: Why Warm Up NCCL? (Key when using NCCL collectives!)
    //
    //     +------------+       +-------------+        +-------------+
    //     |  1st call  | <-->  | driver/init | <----> | comm setup  |
    //     +------------+       +-------------+        +-------------+
    //         |                    |                       |
    //         |    (may trigger heavy startup latency)     |
    //         v                                            v
    //     +-----------------------------------------------------------+
    //     |  GPU/NIC queues/cache tuning, runtime load, device setup  |
    //     +-----------------------------------------------------------+
    //
    //   * The FIRST communication often triggers JIT initialization and
    //     device/NIC warmup (while later calls are much faster).
    //   * So: always warm up before performance measurement for fair results!
    //------------------------------------------------------------------------

    PUSH_RANGE("NCCL_Warmup", 5)
    for (int i = 0; i < 10; ++i) {
        const int top = rank > 0 ? rank - 1 : (size - 1);
        const int bottom = (rank + 1) % size;
        //TODO: Replace the MPI_Sendrecv calls with ncclRecv and ncclSend calls using the nccl communicator 
        //      on the compute_stream.
        //      Remeber that a group of ncclRecv and ncclSend should be within a ncclGroupStart() and ncclGroupEnd()
        //      Also, Rember to stream synchronize on the compute_stream at the end
        NCCL_CALL(ncclGroupStart());
        NCCL_CALL(ncclRecv(a_new,                     nx, NCCL_REAL_TYPE, top,    nccl_comm, compute_stream));
        NCCL_CALL(ncclSend(a_new + (iy_end - 1) * nx, nx, NCCL_REAL_TYPE, bottom, nccl_comm, compute_stream));
        NCCL_CALL(ncclRecv(a_new + (iy_end * nx),     nx, NCCL_REAL_TYPE, bottom, nccl_comm, compute_stream));
        NCCL_CALL(ncclSend(a_new + iy_start * nx,     nx, NCCL_REAL_TYPE, top,    nccl_comm, compute_stream));
        NCCL_CALL(ncclGroupEnd());
        CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
        std::swap(a_new, a);
    }
    POP_RANGE

    CUDA_RT_CALL(cudaDeviceSynchronize());

    //----------------------------------------------------------------------------
    //                             MAIN OUTPUT HEADER REGION                      
    //----------------------------------------------------------------------------
    if (!csv && 0 == rank) {
        printf(
            "Jacobi relaxation: %d iterations on %d x %d mesh with norm check "
            "every %d iterations\n",
            iter_max, ny, nx, nccheck);
    }

    //----------------------------------------------------------------------------
    //                             JACOBI SOLVER REGION                          
    //----------------------------------------------------------------------------
    int iter = 0;
    bool calculate_norm;
    real l2_norm = 1.0;

    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
    double start = MPI_Wtime();
    PUSH_RANGE("Jacobi solve", 0)
    while (l2_norm > tol && iter < iter_max) {
        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream));
        CUDA_RT_CALL(cudaEventRecord(reset_l2norm_done, compute_stream));

        CUDA_RT_CALL(cudaStreamWaitEvent(push_stream, reset_l2norm_done, 0));
        calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0);

        launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_start + 1), (iy_end - 1), nx, calculate_norm,
                             compute_stream);

        launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, (iy_start + 1), nx, calculate_norm,
                             push_stream);

        launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_end - 1), iy_end, nx, calculate_norm,
                             push_stream);
        CUDA_RT_CALL(cudaEventRecord(push_prep_done, push_stream));

        if (calculate_norm) {
            CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, push_prep_done, 0));
            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost,
                                         compute_stream));
        }

        const int top = rank > 0 ? rank - 1 : (size - 1);
        const int bottom = (rank + 1) % size;

        // Apply periodic boundary conditions
        //TODO: Modify the lable for the RANGE, and replace MPI_Sendrecv with ncclSend and ncclRecv calls
        //      using the nccl communicator and push_stream.
        //      Remember to use ncclGroupStart() and ncclGroupEnd()
        PUSH_RANGE("NCCL_LAUNCH", 5)
        NCCL_CALL(ncclGroupStart());
        NCCL_CALL(ncclRecv(a_new,                     nx, NCCL_REAL_TYPE, top,    nccl_comm, push_stream));
        NCCL_CALL(ncclSend(a_new + (iy_end - 1) * nx, nx, NCCL_REAL_TYPE, bottom, nccl_comm, push_stream));
        NCCL_CALL(ncclRecv(a_new + (iy_end * nx),     nx, NCCL_REAL_TYPE, bottom, nccl_comm, push_stream));
        NCCL_CALL(ncclSend(a_new + iy_start * nx,     nx, NCCL_REAL_TYPE, top,    nccl_comm, push_stream));
        NCCL_CALL(ncclGroupEnd());
        CUDA_RT_CALL(cudaEventRecord(push_done, push_stream));
        POP_RANGE

        if (calculate_norm) {
            CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
            MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD));
            l2_norm = std::sqrt(l2_norm);

            if (!csv && 0 == rank && (iter % 100) == 0) {
                printf("%5d, %0.6f\n", iter, l2_norm);
            }
        }
        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, push_done, 0));

        std::swap(a_new, a);
        iter++;
    }
    CUDA_RT_CALL(cudaDeviceSynchronize());
    double stop = MPI_Wtime();
    POP_RANGE

    //----------------------------------------------------------------------------
    //                            VALIDATION REGION                              
    //----------------------------------------------------------------------------
    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(real),
                            cudaMemcpyDeviceToHost));

    int result_correct = 1;
    for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
                fprintf(stderr,
                        "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
                        "(reference)\n",
                        rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
                result_correct = 0;
            }
        }
    }

    int global_result_correct = 1;
    MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
                           MPI_COMM_WORLD));
    result_correct = global_result_correct;

    //----------------------------------------------------------------------------
    //                              FINAL OUTPUT REGION                          
    //----------------------------------------------------------------------------
    if (rank == 0 && result_correct) {
        if (csv) {
            //TODO: Dont forget to change your output lable from mpi_overlap to nccl_overlap 
            printf("nccl_overlap, %d, %d, %d, %d, %d, 1, %f, %f\n", nx, ny, iter_max, nccheck, size,
                    (stop - start), runtime_serial);
        } else {
            printf("Num GPUs: %d.\n", size);
            printf(
                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
                "efficiency: %8.2f \n",
                ny, nx, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
                runtime_serial / (size * (stop - start)) * 100);
        }
    }
    //----------------------------------------------------------------------------
    //                                  CLEANUP REGION                           
    //----------------------------------------------------------------------------
    CUDA_RT_CALL(cudaEventDestroy(reset_l2norm_done));
    CUDA_RT_CALL(cudaEventDestroy(push_done));
    CUDA_RT_CALL(cudaEventDestroy(push_prep_done));
    CUDA_RT_CALL(cudaStreamDestroy(push_stream));
    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));

    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
    CUDA_RT_CALL(cudaFree(l2_norm_d));
#if NCCL_UB_SUPPORT
    if (user_buffer_reg) {
    //TODO: Deregister and Free the Buffer
        NCCL_CALL(ncclCommDeregister(nccl_comm, a_new_reg_handle));
        NCCL_CALL(ncclCommDeregister(nccl_comm, a_reg_handle));
        NCCL_CALL(ncclMemFree(a_new));
        NCCL_CALL(ncclMemFree(a));
    }
    else
#endif //NCCL_UB_SUPPORT
    {
    CUDA_RT_CALL(cudaFree(a_new));
    CUDA_RT_CALL(cudaFree(a));
    }
    CUDA_RT_CALL(cudaFreeHost(a_h));
    CUDA_RT_CALL(cudaFreeHost(a_ref_h));

    //TODO: Destroy the nccl communicator
    NCCL_CALL(ncclCommDestroy(nccl_comm));

    MPI_CALL(MPI_Finalize());
    return (result_correct == 1) ? 0 : 1;
}

// The single_gpu interface is identical to the MPI implementation
double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
                  const int nccheck, const bool print);

// jacobi_kernels.cu
//
// #############################################################################
// #                          KERNEL REGION                                    #
// #############################################################################

[wood2@jwlogin24 NCCL]$ make
nvcc \
    -DHAVE_CUB \
    -lineinfo \
    -gencode arch=compute_80,code=sm_80 \
    -gencode arch=compute_80,code=compute_80 \
    -std=c++14 \
    jacobi_kernels.cu -c
mpicxx -DUSE_NVTX \
    -I/p/software/juwelsbooster/stages/2025/software/CUDA/12/include \
    -I/usr/include \
    -std=c++14 \
    jacobi.cpp jacobi_kernels.o \
    -L/p/software/juwelsbooster/stages/2025/software/CUDA/12/lib64 \
    -lcudart -lnvToolsExt -lnccl \
    -o jacobi
Ⓒ [wood2@jwlogin24 NCCL]$ make run
salloc --reservation sc25-multigpu-tut --partition booster --gres=gpu:4 --disable-dcgm --time 0:10:00 srun --cpu-bind=sockets --unbuffered -n 4 ./jacobi
salloc: Granted job allocation 12864150
salloc: Waiting for resource configuration
salloc: Nodes jwb0158 are ready for job
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
Jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999020
  100, 0.897983
  200, 0.535685
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235510
  700, 0.209829
  800, 0.189854
  900, 0.173818
Num GPUs: 4.
16384x16384: 1 GPU:   2.3275 s, 4 GPUs:   0.6014 s, speedup:     3.87, efficiency:    96.75 
salloc: Relinquishing job allocation 12864150
salloc: Job allocation 12864150 has been revoked.
Ⓒ [wood2@jwlogin24 NCCL]$ salloc --reservation sc25-multigpu-tut \
       --partition booster \
       --gres=gpu:4 \
       --nodes=4 \
       --disable-dcgm \
       --time 0:10:00 \
srun --cpu-bind=sockets --unbuffered -n 16 ./jacobi
salloc: Pending job allocation 12864152
salloc: job 12864152 queued and waiting for resources
salloc: job 12864152 has been allocated resources
salloc: Granted job allocation 12864152
salloc: Waiting for resource configuration
salloc: Nodes jwb[0066,0091,0103,0127] are ready for job
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
Jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999023
  100, 0.897983
  200, 0.535685
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235510
  700, 0.209829
  800, 0.189854
  900, 0.173818
Num GPUs: 16.
16384x16384: 1 GPU:   2.3212 s, 16 GPUs:   0.1885 s, speedup:    12.31, efficiency:    76.96 
salloc: Relinquishing job allocation 12864152

NVSHMEM

NVSHMEM and NCCL are both designed to move data efficiently between GPUs, but they target very different communication styles. NCCL is a collective communication library tuned for deep learning workloads, where you repeatedly perform big operations like all-reduce on dense tensors. NVSHMEM, by contrast, is a one-sided PGAS (Partitioned Global Address Space) library that lets GPU threads directly read and write memory on remote GPUs using fine-grained put, get, and atomic operations. Where NCCL assumes relatively regular, bulk communication, NVSHMEM is built for irregular, dynamic, and fine-grained patterns driven directly from kernels.

In terms of communication model, NCCL gives you a set of highly optimized collective operations (all-reduce, broadcast, all-gather, reduce-scatter). You call a collective from the CPU, NCCL arranges the GPUs in rings or trees, and then shuffles data around according to that fixed pattern. NVSHMEM instead gives you a global symmetric memory space across GPUs; each GPU allocates a symmetric heap, and any GPU can directly write into or read from another GPU’s symmetric region. Conceptually, NCCL is “everyone participates in the same collective,” while NVSHMEM is “anyone can read/write anyone, whenever needed.”

+------------------+     +------------------+     +------------------+
|  GPU 0 Memory    |     |  GPU 1 Memory    |     |  GPU 2 Memory    |
|  [Symmetric Heap]|     |  [Symmetric Heap]|     |  [Symmetric Heap]|
+------------------+     +------------------+     +------------------+
       |                          |                         |
       +------- Global NVSHMEM Virtual Address Space ------+

NCCL (collective pattern):

GPU0 ──► GPU1 ──► GPU2 ──► GPU3
                           
  └─────────────────────────┘
    (ring all-reduce, structured pattern)


NVSHMEM (PGAS one-sided pattern):

GPU0 ─────────────► GPU3
GPU1 ───► GPU2
GPU2 ───────────────────► GPU0
GPU3 ─────► GPU1

(any GPU can PUT/GET to any other GPU)

Another important contrast is who initiates communication and from where. With NCCL, the CPU triggers operations such as ncclAllReduce(...); the GPUs then run the collective, but you cannot call NCCL primitives from inside a CUDA kernel. NVSHMEM is explicitly designed for GPU-initiated communication: CUDA kernels can invoke nvshmem_put, nvshmem_get, and atomics directly. That means GPU threads can decide on-the-fly which peers to talk to, overlap computation and communication within the same kernel, and avoid frequent CPU synchronization.

NCCL control path:

  CPU calls ncclAllReduce(...)
       │
       ▼
     GPUs execute collective
       │
       ▼
    Result ready on GPUs


NVSHMEM control path:

GPU kernel thread
       │
       ├─ compute
       │
       ├─ nvshmem_put / nvshmem_get (direct GPU→GPU)
       │
       └─ compute more

Because of this, NCCL is fantastic for large, regular data movements—think gradient all-reduces across many GPUs in distributed training. It optimizes topology (NVLink, NVSwitch, InfiniBand), picks efficient ring or tree algorithms, and squeezes out maximum bandwidth for big tensors. NVSHMEM shines when communication is irregular or fine-grained, such as graph algorithms, stencil-based PDE solvers with halo exchanges, adaptive mesh refinement, or sparse updates, where each GPU might exchange small pieces of data with a changing set of neighbors from inside a kernel.

/* 
 * SPDX-FileCopyrightText: Copyright (c) 2017,2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: MIT
 * 
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 * 
 * The above copyright notice and this permission notice shall be included in all
 * copies or substantial portions of the Software.
 * 
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */
// ==================================================
//            Header files & macro definitions
// ==================================================
#include <algorithm>
#include <cmath>
#include <cstdio>
#include <iostream>
#include <sstream>

#include <mpi.h>
#define MPI_CALL(call)                                                                \
    {                                                                                 \
        int mpi_status = call;                                                        \
        if (0 != mpi_status) {                                                        \
            char mpi_error_string[MPI_MAX_ERROR_STRING];                              \
            int mpi_error_string_length = 0;                                          \
            MPI_Error_string(mpi_status, mpi_error_string, &mpi_error_string_length); \
            if (NULL != mpi_error_string)                                             \
                fprintf(stderr,                                                       \
                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
                        "with %s "                                                    \
                        "(%d).\n",                                                    \
                        #call, __LINE__, __FILE__, mpi_error_string, mpi_status);     \
            else                                                                      \
                fprintf(stderr,                                                       \
                        "ERROR: MPI call \"%s\" in line %d of file %s failed "        \
                        "with %d.\n",                                                 \
                        #call, __LINE__, __FILE__, mpi_status);                       \
        }                                                                             \
    }

#include <cuda_runtime.h>

//TODO: Include NVSHMEM headers
#include <nvshmem.h>
#include <nvshmemx.h>

#ifdef HAVE_CUB
#include <cub/block/block_reduce.cuh>
#endif  // HAVE_CUB

#ifdef USE_NVTX
#include <nvToolsExt.h>

// ---- NVTX color and macro configs ----
const uint32_t colors[] = {0x0000ff00, 0x000000ff, 0x00ffff00, 0x00ff00ff,
                           0x0000ffff, 0x00ff0000, 0x00ffffff};
const int num_colors = sizeof(colors) / sizeof(uint32_t);

#define PUSH_RANGE(name, cid)                              \
    {                                                      \
        int color_id = cid;                                \
        color_id = color_id % num_colors;                  \
        nvtxEventAttributes_t eventAttrib = {0};           \
        eventAttrib.version = NVTX_VERSION;                \
        eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;  \
        eventAttrib.colorType = NVTX_COLOR_ARGB;           \
        eventAttrib.color = colors[color_id];              \
        eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; \
        eventAttrib.message.ascii = name;                  \
        nvtxRangePushEx(&eventAttrib);                     \
    }
#define POP_RANGE nvtxRangePop();
#else
#define PUSH_RANGE(name, cid)
#define POP_RANGE
#endif

#define CUDA_RT_CALL(call)                                                                  \
    {                                                                                       \
        cudaError_t cudaStatus = call;                                                      \
        if (cudaSuccess != cudaStatus)                                                      \
            fprintf(stderr,                                                                 \
                    "ERROR: CUDA RT call \"%s\" in line %d of file %s failed "              \
                    "with "                                                                 \
                    "%s (%d).\n",                                                           \
                    #call, __LINE__, __FILE__, cudaGetErrorString(cudaStatus), cudaStatus); \
    }

// ==================================================
//                Typedefs & constants
// ==================================================
#ifdef USE_DOUBLE
typedef double real;
#define MPI_REAL_TYPE MPI_DOUBLE
#else
typedef float real;
#define MPI_REAL_TYPE MPI_FLOAT
#endif

constexpr real tol = 1.0e-8;

const real PI = 2.0 * std::asin(1.0);

// ==================================================
//               GPU Kernel Definitions
// ==================================================
__global__ void initialize_boundaries(real* __restrict__ const a_new, real* __restrict__ const a,
                                      const real pi, const int offset, const int nx,
                                      const int my_ny, const int ny) {
    for (int iy = blockIdx.x * blockDim.x + threadIdx.x; iy < my_ny; iy += blockDim.x * gridDim.x) {
        const real y0 = sin(2.0 * pi * (offset + iy) / (ny - 1));
        a[iy * nx + 0] = y0;
        a[iy * nx + (nx - 1)] = y0;
        a_new[iy * nx + 0] = y0;
        a_new[iy * nx + (nx - 1)] = y0;
    }
}

template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
                              real* __restrict__ const l2_norm, const int iy_start,
                              const int iy_end, const int nx, const bool calculate_norm) {
#ifdef HAVE_CUB
    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
        BlockReduce;
    __shared__ typename BlockReduce::TempStorage temp_storage;
#endif  // HAVE_CUB
    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
    real local_l2_norm = 0.0;

    if (iy < iy_end && ix < (nx - 1)) {
        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
        a_new[iy * nx + ix] = new_val;
        if (calculate_norm) {
            real residue = new_val - a[iy * nx + ix];
            local_l2_norm += residue * residue;
        }
    }
    if (calculate_norm) {
#ifdef HAVE_CUB
        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
#else
        atomicAdd(l2_norm, local_l2_norm);
#endif  // HAVE_CUB
    }
}

// ==================================================
//         Kernel Launch Helper Function
// ==================================================
void launch_jacobi_kernel(real* __restrict__ const a_new, const real* __restrict__ const a,
                          real* __restrict__ const l2_norm, const int iy_start, const int iy_end,
                          const int nx, const bool calculate_norm, cudaStream_t stream) {
    constexpr int dim_block_x = 32;
    constexpr int dim_block_y = 32;
    dim3 dim_grid((nx + dim_block_x - 1) / dim_block_x,
                  ((iy_end - iy_start) + dim_block_y - 1) / dim_block_y, 1);
    jacobi_kernel<dim_block_x, dim_block_y><<<dim_grid, {dim_block_x, dim_block_y, 1}, 0, stream>>>(
        a_new, a, l2_norm, iy_start, iy_end, nx, calculate_norm);
    CUDA_RT_CALL(cudaGetLastError());
}

// The single_gpu interface is identical to the MPI implementation
double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
                  const int nccheck, const bool print);

// ==================================================
//          Argument Processing Helper Functions
// ==================================================
template <typename T>
T get_argval(char** begin, char** end, const std::string& arg, const T default_val) {
    T argval = default_val;
    char** itr = std::find(begin, end, arg);
    if (itr != end && ++itr != end) {
        std::istringstream inbuf(*itr);
        inbuf >> argval;
    }
    return argval;
}

bool get_arg(char** begin, char** end, const std::string& arg) {
    char** itr = std::find(begin, end, arg);
    if (itr != end) {
        return true;
    }
    return false;
}

// ==================================================
//                Main Program Entry
// ==================================================
int main(int argc, char* argv[]) {
    //
    // ===== Main Function Organization =====
    //
    // 1. MPI Initialization & Device Setup
    MPI_CALL(MPI_Init(&argc, &argv));
    int rank;
    MPI_CALL(MPI_Comm_rank(MPI_COMM_WORLD, &rank));
    int size;
    MPI_CALL(MPI_Comm_size(MPI_COMM_WORLD, &size));
    int num_devices = 0;
    CUDA_RT_CALL(cudaGetDeviceCount(&num_devices));

    // 2. Argument Parsing
    const int iter_max = get_argval<int>(argv, argv + argc, "-niter", 1000);
    const int nccheck = get_argval<int>(argv, argv + argc, "-nccheck", 1);
    const int nx = get_argval<int>(argv, argv + argc, "-nx", 16384);
    const int ny = get_argval<int>(argv, argv + argc, "-ny", 16384);
    const bool csv = get_arg(argv, argv + argc, "-csv");

    // 3. Establish Local Rank (for CUDA device binding)
    int local_rank = -1;
    {
        MPI_Comm local_comm;
        MPI_CALL(MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL,
                                     &local_comm));

        MPI_CALL(MPI_Comm_rank(local_comm, &local_rank));

        MPI_CALL(MPI_Comm_free(&local_comm));
    }

    CUDA_RT_CALL(cudaSetDevice(local_rank%num_devices));
    CUDA_RT_CALL(cudaFree(0));

    //TODO: Initialize NVSHMEM using nvshmemx_init_attr
    // 4. NVSHMEM Initialization
    MPI_Comm mpi_comm = MPI_COMM_WORLD;
    nvshmemx_init_attr_t attr;
    attr.mpi_comm = &mpi_comm;
    nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);

    assert( size == nvshmem_n_pes() );
    assert( rank == nvshmem_my_pe() );

    // 5. Host-side Data Allocation & Serial Reference Run
    real* a_ref_h;
    CUDA_RT_CALL(cudaMallocHost(&a_ref_h, nx * ny * sizeof(real)));
    real* a_h;
    CUDA_RT_CALL(cudaMallocHost(&a_h, nx * ny * sizeof(real)));
    double runtime_serial = single_gpu(nx, ny, iter_max, a_ref_h, nccheck, !csv && (0 == rank));

    // ny - 2 rows are distributed amongst `size` ranks in such a way
    // that each rank gets either (ny - 2) / size or (ny - 2) / size + 1 rows.
    // This optimizes load balancing when (ny - 2) % size != 0
    // 6. Domain Decomposition
    int chunk_size;
    int chunk_size_low = (ny - 2) / size;
    int chunk_size_high = chunk_size_low + 1;
    // To calculate the number of ranks that need to compute an extra row,
    // the following formula is derived from this equation:
    // num_ranks_low * chunk_size_low + (size - num_ranks_low) * (chunk_size_low + 1) = ny - 2
    int num_ranks_low = size * chunk_size_low + size -
                        (ny - 2);  // Number of ranks with chunk_size = chunk_size_low
    if (rank < num_ranks_low)
        chunk_size = chunk_size_low;
    else
        chunk_size = chunk_size_high;

    //TODO: Allocate a and a_new from the NVSHMEM symmetric heap instead of using cudaMalloc
    //      Note: size needs to be the same on all PEs but chunk_size might not be!
    // 7. Symmetric Heap (NVSHMEM) Allocation for Device Arrays
    real* a = (real*) nvshmem_malloc(nx * (chunk_size_high + 2) * sizeof(real));
    real* a_new = (real*) nvshmem_malloc(nx * (chunk_size_high + 2) * sizeof(real));

    CUDA_RT_CALL(cudaMemset(a, 0, nx * (chunk_size + 2) * sizeof(real)));
    CUDA_RT_CALL(cudaMemset(a_new, 0, nx * (chunk_size + 2) * sizeof(real)));

    // Calculate local domain boundaries
    int iy_start_global;  // My start index in the global array
    if (rank < num_ranks_low) {
        iy_start_global = rank * chunk_size_low + 1;
    } else {
        iy_start_global =
            num_ranks_low * chunk_size_low + (rank - num_ranks_low) * chunk_size_high + 1;
    }
    int iy_end_global = iy_start_global + chunk_size - 1;  // My last index in the global array

    int iy_start = 1;
    int iy_end = iy_start + chunk_size;

    const int top = rank > 0 ? rank - 1 : (size - 1);
    const int bottom = (rank + 1) % size;

    //TODO: calculate halo/boundary row index of top and bottom neighbors
    const int iy_top_lower_boundary_idx = (top < num_ranks_low) ? (chunk_size_low + 1) : (chunk_size_high + 1);
    const int iy_bottom_upper_boundary_idx = 0;

    // Set diriclet boundary conditions on left and right boarder
    // 8. Set Dirichlet Boundary Conditions
    initialize_boundaries<<<(chunk_size + 2) / 128 + 1, 128>>>(a, a_new, PI, iy_start_global - 1, nx, (chunk_size + 2), ny);
    CUDA_RT_CALL(cudaGetLastError());
    CUDA_RT_CALL(cudaDeviceSynchronize());

    // 9. CUDA Streams & Events Initialization
    int leastPriority = 0;
    int greatestPriority = leastPriority;
    CUDA_RT_CALL(cudaDeviceGetStreamPriorityRange(&leastPriority, &greatestPriority));
    cudaStream_t compute_stream;
    CUDA_RT_CALL(cudaStreamCreateWithPriority(&compute_stream, cudaStreamDefault, leastPriority));
    cudaStream_t push_stream;
    CUDA_RT_CALL(
        cudaStreamCreateWithPriority(&push_stream, cudaStreamDefault, greatestPriority));

    cudaEvent_t push_prep_done;
    CUDA_RT_CALL(cudaEventCreateWithFlags(&push_prep_done, cudaEventDisableTiming));
    cudaEvent_t push_done;
    CUDA_RT_CALL(cudaEventCreateWithFlags(&push_done, cudaEventDisableTiming));
    cudaEvent_t reset_l2norm_done;
    CUDA_RT_CALL(cudaEventCreateWithFlags(&reset_l2norm_done, cudaEventDisableTiming));

//    cudaStream_t compute_stream;
//    CUDA_RT_CALL(cudaStreamCreate(&compute_stream));
//    cudaEvent_t compute_done;
//    CUDA_RT_CALL(cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming));

    // 10. L2 Norm Buffers (Device and Host)
    real* l2_norm_d;
    CUDA_RT_CALL(cudaMalloc(&l2_norm_d, sizeof(real)));
    real* l2_norm_h;
    CUDA_RT_CALL(cudaMallocHost(&l2_norm_h, sizeof(real)));

    CUDA_RT_CALL(cudaDeviceSynchronize());

    // 11. Print Problem Information (if not CSV report)
    if (!csv && 0 == rank) {
        printf(
            "Jacobi relaxation: %d iterations on %d x %d mesh with norm check "
            "every %d iterations\n",
            iter_max, ny, nx, nccheck);
    }

    // 12. Jacobi Iteration Main Loop
    int iter = 0;
    real l2_norm = 1.0;
    bool calculate_norm;  // boolean to store whether l2 norm will be calculated in
                          //   an iteration or not

    MPI_CALL(MPI_Barrier(MPI_COMM_WORLD));
    double start = MPI_Wtime();
    PUSH_RANGE("Jacobi solve", 0)
    while (l2_norm > tol && iter < iter_max) {
        CUDA_RT_CALL(cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream));
        CUDA_RT_CALL(cudaEventRecord(reset_l2norm_done, compute_stream));

        CUDA_RT_CALL(cudaStreamWaitEvent(push_stream, reset_l2norm_done, 0));
        calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0);

    launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_start + 1), (iy_end - 1), nx, calculate_norm, compute_stream);

    launch_jacobi_kernel(a_new, a, l2_norm_d, iy_start, (iy_start + 1), nx, calculate_norm, push_stream);
    
    launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_end - 1), iy_end, nx, calculate_norm, push_stream);

    CUDA_RT_CALL(cudaEventRecord(push_prep_done, push_stream));

        if (calculate_norm) {
        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, push_prep_done, 0));
            CUDA_RT_CALL(cudaMemcpyAsync(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost,
                                         compute_stream));
        }

        //TODO: Replace MPI communication with Host initiated NVSHMEM calls
        // Apply periodic boundary conditions

        // ========================================================
        // NCCL vs. NVSHMEM Communication
        // ========================================================
        //
        //   NCCL (Collective Operations: e.g., AllReduce, host-initiated)
        //   -------------------------------------------------------------
        //   [CPU/host launches collective, e.g., ncclAllReduce. GPUs cooperate;
        //    communication is structured (ring/tree) and not from within device code.]
        //
        //    +--------+     +--------+     +--------+     +--------+
        //    | GPU 0  |<--->| GPU 1  |<--->| GPU 2  |<--->| GPU 3  |
        //    +--------+     +--------+     +--------+     +--------+
        //        ^                                          |
        //        +------------------------------------------+
        //      (All participate in global pattern, CPU initiates)
        //
        //
        //   NVSHMEM (PGAS, device-initiated, one-sided)
        //   -------------------------------------------
        //   [Any CUDA kernel thread can communicate with any peer, device-to-device.
        //    Halo exchange in stencil: each GPU puts its boundary rows directly
        //    to neighbors, possibly inside the kernel, fully asynchronous and fine-grained.]
        //
        //    +--------+             +--------+             +--------+             +--------+
        //    | GPU 0  |<~ ~ ~ ~ ~ ~>| GPU 1  |<~ ~ ~ ~ ~ ~>| GPU 2  |<~ ~ ~ ~ ~ ~>| GPU 3  |
        //    +--------+             +--------+             +--------+             +--------+
        //    |     ^   \           /    ^    \           /    ^    \           /    ^    |
        //    v     |    \         /     |     \         /     |     \         /     |    v
        //  (put/get)    <-------                <-------              <-------    
        //    Any-to-any, any pattern, any time (no CPU needed, initiated within device code)
        //
        //   In this code's context: Each GPU "puts" its halo rows to its top/bottom neighbor
        //   directly, using nvshmemx_float_put_on_stream. No CPU or global MPI needed.
        // ========================================================
        //
        // NVSHMEM Global Symmetric Memory Layout Example:
        // -----------------------------------------------
        // Suppose each rank allocates a_new[] and a[] as:
        //    real* a_new = (real*) nvshmem_malloc(chunk_size * nx * sizeof(real));
        // The symmetric heap on ALL GPUs (PEs) then looks like:
        //
        //   Symmetric Heap [NVSHMEM] across PEs:
        //
        //    PE 0:  a_new [0 ... chunk_size*nx-1], a [0 ... chunk_size*nx-1]
        //    PE 1:  a_new [0 ... chunk_size*nx-1], a [0 ... chunk_size*nx-1]
        //    PE 2:  a_new [0 ... chunk_size*nx-1], a [0 ... chunk_size*nx-1]
        //    PE 3:  ...
        //
        //  Each PE can access the other's symmetric memory using NVSHMEM puts/gets:
        //    // On PE X:
        //    nvshmemx_float_put_on_stream(remote_addr_on_PEY, local_addr_on_PEX, ..., PEY, ...);
        //
        //  For halo exchange: we copy our top or bottom row directly into the appropriate
        //  location of our top or bottom neighbor's symmetric a_new[] buffer.
        //
        //  Example showing local arrays and halo regions (1D for clarity):
        //
        //   |<--- owned chunk --->|
        //   +---------------------+
        //   |halo|interior|halo   |
        //   +----+--------+------+
        //        |        |
        //      [iy_start] [iy_end-1]
        //
        //   - We "put" our [iy_start] row to top neighbor PE's halo,
        //     and [iy_end-1] row to bottom neighbor PE's halo.
        //
        // ========================================================

        PUSH_RANGE("NVSHMEM", 5)
        // Send top boundary row to top neighbor
        nvshmemx_float_put_on_stream(
            a_new + iy_top_lower_boundary_idx * nx,        // Destination in neighbor
            a_new + iy_start * nx,                         // Source (my top interior row)
            nx,
            top,                                           // Neighbor PE id
            push_stream);

        // Send bottom boundary row to bottom neighbor
        nvshmemx_float_put_on_stream(
            a_new + iy_bottom_upper_boundary_idx * nx,     // Destination in neighbor
            a_new + (iy_end - 1) * nx,                     // Source (my bottom interior row)
            nx,
            bottom,                                        // Neighbor PE id
            push_stream);

        CUDA_RT_CALL(cudaEventRecord(push_done, push_stream));
        POP_RANGE

        // Ensure peer data is sent before proceeding with computation on next iteration
        CUDA_RT_CALL(cudaStreamWaitEvent(compute_stream, push_done, 0));

        // Synchronize all PEs (GPUs) at device level to ensure halo consistency
        // across all neighbors, before next Jacobi step.
        nvshmemx_barrier_all_on_stream(compute_stream);

        if (calculate_norm) {
            CUDA_RT_CALL(cudaStreamSynchronize(compute_stream));
            // Still using MPI_Allreduce for l2_norm. (Could be ported to NVSHMEM reduction.)
            MPI_CALL(MPI_Allreduce(l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD));
            l2_norm = std::sqrt(l2_norm);

            if (!csv && 0 == rank && (iter % 100) == 0) {
                printf("%5d, %0.6f\n", iter, l2_norm);
            }
        }
        std::swap(a_new, a);
        iter++;
    }
    CUDA_RT_CALL(cudaDeviceSynchronize());
    double stop = MPI_Wtime();
    POP_RANGE

    CUDA_RT_CALL(cudaMemcpy(a_h + iy_start_global * nx, a + nx,
                            std::min((ny - iy_start_global) * nx, chunk_size * nx) * sizeof(real),
                            cudaMemcpyDeviceToHost));

    int result_correct = 1;
    for (int iy = iy_start_global; result_correct && (iy < iy_end_global); ++iy) {
        for (int ix = 1; result_correct && (ix < (nx - 1)); ++ix) {
            if (std::fabs(a_ref_h[iy * nx + ix] - a_h[iy * nx + ix]) > tol) {
                fprintf(stderr,
                        "ERROR on rank %d: a[%d * %d + %d] = %f does not match %f "
                        "(reference)\n",
                        rank, iy, nx, ix, a_h[iy * nx + ix], a_ref_h[iy * nx + ix]);
                result_correct = 0;
            }
        }
    }

    int global_result_correct = 1;
    MPI_CALL(MPI_Allreduce(&result_correct, &global_result_correct, 1, MPI_INT, MPI_MIN,
                           MPI_COMM_WORLD));
    result_correct = global_result_correct;

    if (rank == 0 && result_correct) {
        if (csv) {
            //TODO: Replace MPI with NVSHMEM for your output
            printf("nvshmem, %d, %d, %d, %d, %d, 1, %f, %f\n", nx, ny, iter_max, nccheck, size,
                   (stop - start), runtime_serial);
        } else {
            printf("Num GPUs: %d.\n", size);
            printf(
                "%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, "
                "efficiency: %8.2f \n",
                ny, nx, runtime_serial, size, (stop - start), runtime_serial / (stop - start),
                runtime_serial / (size * (stop - start)) * 100);
        }
    }
    CUDA_RT_CALL(cudaEventDestroy(reset_l2norm_done));
    CUDA_RT_CALL(cudaEventDestroy(push_done));
    CUDA_RT_CALL(cudaEventDestroy(push_prep_done));
    CUDA_RT_CALL(cudaStreamDestroy(push_stream));
    CUDA_RT_CALL(cudaStreamDestroy(compute_stream));

    CUDA_RT_CALL(cudaFreeHost(l2_norm_h));
    CUDA_RT_CALL(cudaFree(l2_norm_d));

    //TODO: Deallocated a_new and a from the NVSHMEM symmetric heap instead of using cudaFree()
    nvshmem_free(a_new);
    nvshmem_free(a);

    CUDA_RT_CALL(cudaFreeHost(a_h));
    CUDA_RT_CALL(cudaFreeHost(a_ref_h));

    //TODO: Finalize NVSHMEM
    nvshmem_finalize();

    MPI_CALL(MPI_Finalize());
    return (result_correct == 1) ? 0 : 1;
}

// ==================================================
//        Single-GPU Reference Kernel (for check)
// ==================================================
template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void jacobi_kernel_single_gpu(real* __restrict__ const a_new, const real* __restrict__ const a,
                              real* __restrict__ const l2_norm, const int iy_start,
                              const int iy_end, const int nx, const bool calculate_norm) {
#ifdef HAVE_CUB
    typedef cub::BlockReduce<real, BLOCK_DIM_X, cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y>
        BlockReduce;
    __shared__ typename BlockReduce::TempStorage temp_storage;
#endif  // HAVE_CUB
    int iy = blockIdx.y * blockDim.y + threadIdx.y + iy_start;
    int ix = blockIdx.x * blockDim.x + threadIdx.x + 1;
    real local_l2_norm = 0.0;

    if (iy < iy_end && ix < (nx - 1)) {
        const real new_val = 0.25 * (a[iy * nx + ix + 1] + a[iy * nx + ix - 1] +
                                     a[(iy + 1) * nx + ix] + a[(iy - 1) * nx + ix]);
        a_new[iy * nx + ix] = new_val;
        if (calculate_norm) {
            real residue = new_val - a[iy * nx + ix];
            local_l2_norm += residue * residue;
        }
    }
    if (calculate_norm) {
#ifdef HAVE_CUB
        real block_l2_norm = BlockReduce(temp_storage).Sum(local_l2_norm);
        if (0 == threadIdx.y && 0 == threadIdx.x) atomicAdd(l2_norm, block_l2_norm);
#else
        atomicAdd(l2_norm, local_l2_norm);
#endif  // HAVE_CUB
    }
}

// The single_gpu interface is identical to the MPI implementation
double single_gpu(const int nx, const int ny, const int iter_max, real* const a_ref_h,
                  const int nccheck, const bool print);

Ⓒ [wood2@jwlogin24 NVSHMEM]$ make 
nvcc -DHAVE_CUB -dc -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt \
    -gencode arch=compute_80,code=sm_80 \
    -gencode arch=compute_80,code=compute_80 \
    -std=c++14 \
    -I/p/software/default/stages/2025/software/NVSHMEM/3.1.7-gompi-2024a-CUDA-12/include \
    -I/p/software/default/stages/2025/software/OpenMPI/5.0.5-GCC-13.3.0/include \
    jacobi.cu -c -o jacobi.o

nvcc \
    -gencode arch=compute_80,code=sm_80 \
    -gencode arch=compute_80,code=compute_80 \
    jacobi.o -o jacobi -ccbin=mpic++ \
    -L/p/software/default/stages/2025/software/NVSHMEM/3.1.7-gompi-2024a-CUDA-12/lib -lnvshmem \
    -L/p/software/default/stages/2025/software/OpenMPI/5.0.5-GCC-13.3.0/lib -lmpi \
    -L/p/software/juwelsbooster/stages/2025/software/CUDA/12/lib64 -lcuda -lcudart -lnvToolsExt -lnvidia-ml
Ⓒ [wood2@jwlogin24 NVSHMEM]$ make run
CUDA_VISIBLE_DEVICES=0,1,2,3 \
salloc --reservation sc25-multigpu-tut \
       --partition=booster \
       --gres=gpu:4 \
       --disable-dcgm \
       --time=0:10:00 \
       srun --cpu-bind=sockets --unbuffered -n 4 ./jacobi
salloc: Granted job allocation 12864158
salloc: Waiting for resource configuration
salloc: Nodes jwb0163 are ready for job
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
Jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535685
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235510
  700, 0.209829
  800, 0.189854
  900, 0.173818
Num GPUs: 4.
16384x16384: 1 GPU:   3.2698 s, 4 GPUs:   0.8556 s, speedup:     3.82, efficiency:    95.54 
Ⓒ [wood2@jwlogin24 NVSHMEM]$ salloc --reservation=sc25-multigpu-tut \
       --partition=booster \
       --nodes=4 \
       --gres=gpu:4 \
       --disable-dcgm \
       --time=0:10:00 \
       srun --cpu-bind=sockets --unbuffered -n 16 ./jacobi
salloc: Granted job allocation 12864181
salloc: Waiting for resource configuration
salloc: Nodes jwb[0066,0091,0103,0127] are ready for job
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
Jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999023
  100, 0.897983
  200, 0.535685
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235510
  700, 0.209829
  800, 0.189854
  900, 0.173818
Num GPUs: 16.
16384x16384: 1 GPU:   3.2878 s, 16 GPUs:   0.2885 s, speedup:    11.40, efficiency:    71.23 
salloc: Relinquishing job allocation 12864181
salloc: Job allocation 12864181 has been revoked.
FeatureMPINCCLNVSHMEM
Main modelMessage passingCollectivesOne-sided PGAS
GPU optimizationMediumVery highVery high
Communication granularityCoarse and fineCoarse (collectives only)Fine (put/get)
GPU-initiated communicationLimitedNoYes
Collective opsRich but slower on GPUsBest performingAvailable but less common
Point-to-point messagingYesNoYes (one-sided)
Best forGeneral HPC, CPU/GPU appsDeep learningGPU-centric irregular HPC

← Back to all posts