Parallel Computing#
Modern scientific computing tasks often involve massive datasets and computationally expensive algorithms. Problems like large-scale simulations, statistical sampling (e.g., Monte Carlo methods), and real-time data processing demand a level of performance that cannot be achieved with sequential execution alone.
Parallel computing is the practice of dividing a problem into smaller subproblems that can be solved simultaneously. With the rise of multicore processors, distributed systems, and GPUs, parallel computing is now essential for high-performance computing (HPC).
This lecture introduces key ideas, theoretical limits, and practical tools for parallel computation.
Theoretical Foundations#
Before we explore specific tools and implementations, it’s important to understand the theoretical limits of parallel computing. These foundational concepts help us answer questions like:
What is the maximum possible speedup if we parallelize a task?
Where should we invest our effort to gain performance?
Why do some problems benefit more from parallelization than others?
In addition, scaling analyses provide practical ways to assess real-world performance of parallel codes.
Amdahl’s Law#
Let
As
Implication: Even small sequential portions limit total speedup.
Amdahl’s law illustrates that optimizing the serial part of a program can be more impactful than parallelizing the rest. For example, if 5% of the computation is inherently sequential, no matter how many processors are used, we cannot speed up the program more than 20x. This is espeically important when one design algorithms for leadership HPC (e.g., DOE Frontier). It corresponds to “strong scaling tests”.
Gustafson’s Law#
Recognizes that in order to fully utilize computing resources, problem size often needs to scale with the number of processors:
Assumes the workload increases with
Implication: In practice, we often scale up problems as we add more resources. This law gives a more optimistic and realistic view in scientific computing, where we often increase the resolution or domain size with more computing power. It corresponds to “weak scaling tests”.
Flynn’s Taxonomy#
A classification of computer architectures:
SISD: Single Instruction Single Data (standard CPU)
SIMD: Single Instruction Multiple Data (vector processors, GPUs)
MISD: Rare, mostly theoretical
MIMD: Multiple Instruction Multiple Data (clusters, multicore CPUs)
Flynn’s taxonomy helps us map programming models to the underlying hardware. For instance, OpenMP typically targets MIMD systems with shared memory, while SIMD models underpin GPUs and vectorized CPU instructions
HPC Accounts#
We may be able to run some code on UA HPC. Let’s gather people’s netid on this Google Spreadsheet.
Monte Carlo Computation of #
We will parallelize a simple algorithm using different techniques.
The first algorithm is monte carlo computation of
Python Series Code#
Here is the algorithm in native python:
import random
def mcpi_loop(n_total):
n_inside = 0
for _ in range(n_total):
x, y = random.random(), random.random()
if x*x + y*y < 1.0:
n_inside += 1
return n_inside
pi = 4 * mcpi_loop(1000_000) / 1000_000
print(pi)
3.14368
%timeit mcpi_loop(1000_000)
160 ms ± 237 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
On my laptop it takes about 80ms to perform 1M samples.
The number of significant digits is
Embarrassingly Parallel Computing#
Since this algorithm is embarrassingly parallelizable, we can simply run it multiple times and compute the mean. Let’s do this as a class exercise using this Google Sheet.
Effectively, we just did a weak scaling test!
Numpy Parallel Code#
When compiled with BLAS backend, Numpy
automatically distribute compute across multiple cores.
import numpy as np
def mcpi_numpy(n_total):
x = np.random.rand(n_total)
y = np.random.rand(n_total)
return np.sum(x*x + y*y < 1.0)
pi = 4 * mcpi_numpy(1000_000) / 1000_000
print(pi)
3.142024
%timeit mcpi_numpy(1000_000)
18.6 ms ± 84.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
JAX Parallel Code#
Many operations in JAX
, especially linear algebra related, automatically use multiple cores.
from jax import numpy as jnp
from jax import random as jrd
def mcpi_jax(n_total):
key = jrd.key(0)
key1, key2 = jrd.split(key)
x = jrd.uniform(key1, n_total)
y = jrd.uniform(key2, n_total)
return jnp.sum(x*x + y*y < 1.0)
pi = 4 * mcpi_jax(1000_000) / 1000_000
print(pi)
3.142476
%timeit mcpi_jax(1000_000)
7.37 ms ± 68 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
C Series Code#
Our original python code can be easily translate to C.
Please put the following code into a new file mcpi_loop.c
:
#include <stdlib.h>
int mcpi_loop(int n_total)
{
int n_inside = 0;
for(int i = 0; i < n_total; ++i) {
double x = (double)rand() / RAND_MAX;
double y = (double)rand() / RAND_MAX;
if(x * x + y * y < 1.0)
++n_inside;
}
return n_inside;
}
Although the actual C code is pretty simple, unfortunately, there is not really a portable, high-resolution time functionon in bash
.
Hence, we need to do timing ourselves in the C code:
#include <time.h>
static struct timespec t0;
void tik()
{
clock_gettime(CLOCK_REALTIME, &t0);
}
double tok()
{
struct timespec t1, dt;
clock_gettime(CLOCK_REALTIME, &t1);
dt.tv_nsec = t1.tv_nsec - t0.tv_nsec;
dt.tv_sec = t1.tv_sec - t0.tv_sec;
if(dt.tv_nsec < 0) {
dt.tv_nsec += 1000000000;
dt.tv_sec--;
}
int ms = dt.tv_nsec / 1000000 + dt.tv_sec * 1000;
int ns = dt.tv_nsec % 1000000;
return ms + 1e-6 * ns;
}
Then we can put it all together and create the main function:
#include <stdio.h>
int main()
{
tik();
double pi = 4.0 * mcpi_loop(1000000) / 1000000;
double ms = tok();
printf("pi = %f\n", pi);
printf("dt = %f ms\n", ms);
return 0;
}
We can put all these functions into a single “mcpi_loop.c” file and then compile with:
gcc mcpi_loop.c -O3 -o mcpi_loop
./mcpi_loop
On my laptop, the run takes ~ 36 ms.
Distributed Memory: MPI#
MPI (Message Passing Interface) is the de facto standard for distributed-memory parallel programming. Unlike OpenMP, where threads share memory, MPI processes have separate memory spaces and must explicitly communicate using messages.
MPI is well-suited for large-scale clusters and supercomputers, where each node has its own memory and processors.
Common MPI functions include:
MPI_Init()
andMPI_Finalize()
for starting and ending MPI programsMPI_Comm_rank()
andMPI_Comm_size()
to identify each process and the total number of processesMPI_Send()
,MPI_Recv()
, andMPI_Reduce()
for data exchange and aggregation
Important note: One of the most commonly used implementations of MPI is OpenMPI. Despite the similar name, OpenMPI is unrelated to OpenMP. The name similarity is unfortunately confusing!
OpenMP is for multithreading on shared-memory systems
OpenMPI is a software implementation of the MPI standard for distributed-memory systems
To compile and run an MPI program, use mpicc
to compile the program and use mpirun -np N ./program
to run it with N
processes.
Here is an example implementation:
#include <mpi.h>
int main(int argc, char** argv)
{
tik();
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int n_total = 1000000;
int l_total = n_total / size;
srand(time(NULL)+rank); // ensure different seed per process
int l_inside = mcpi_loop(l_total);
int n_inside = 0;
MPI_Reduce(&l_inside, &n_inside, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if(rank == 0) {
double pi = 4.0 * n_inside / n_total;
printf("Estimated value of pi: %f\n", pi);
}
MPI_Finalize();
double ms = tok();
if(rank == 0)
printf("dt = %f ms\n", ms);
return 0;
}
On my laptop, when I run mpirun -np 4 ./mcpi_mpi
, it took ~ 25 ms.
This is significantly higher than the OpenMP version because of the all the startup overhead.
Job Submission to HPC with Slurm#
On many HPC clusters, job execution is managed by a workload manager. One of the most widely used systems is Slurm (Simple Linux Utility for Resource Management).
Slurm schedules jobs across compute nodes and handles resource allocation, job queues, and execution environments. Instead of running MPI jobs interactively with mpirun, users typically submit jobs using a Slurm script.
The UA HPC website has some concise documentation of using SLURM.
And here is a sample submission script “submit.sh” that we may use
#!/bin/bash
#SBATCH --job-name=mcpi_mpi
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --partition=standard
#SBATCH --account=2025q1-phys305
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
module load openmpi3
/usr/bin/time -o mcpi_mpi.time mpirun -np 4 mcpi_mpi
To submit the job, simply run
sbatch submit.sh
To check your job, run
squeue -u USRENAME