Parallel Computing

Parallel Computing#

Modern scientific computing tasks often involve massive datasets and computationally expensive algorithms. Problems like large-scale simulations, statistical sampling (e.g., Monte Carlo methods), and real-time data processing demand a level of performance that cannot be achieved with sequential execution alone.

Parallel computing is the practice of dividing a problem into smaller subproblems that can be solved simultaneously. With the rise of multicore processors, distributed systems, and GPUs, parallel computing is now essential for high-performance computing (HPC).

This lecture introduces key ideas, theoretical limits, and practical tools for parallel computation.

Theoretical Foundations#

Before we explore specific tools and implementations, it’s important to understand the theoretical limits of parallel computing. These foundational concepts help us answer questions like:

What is the maximum possible speedup if we parallelize a task?
Where should we invest our effort to gain performance?
Why do some problems benefit more from parallelization than others?

In addition, scaling analyses provide practical ways to assess real-world performance of parallel codes.

Amdahl’s Law#

Let \(f\) be the fraction of a program that must be executed sequentially. The maximum speed-up \(S\) obtainable with \(P\) processors is:

(475)#\[\begin{align} S(P) = \frac{1}{f + (1-f)/P}. \end{align}\]

As \(P \to \infty\), \(S \to 1/f\).

Implication: Even small sequential portions limit total speedup.

Amdahl’s law illustrates that optimizing the serial part of a program can be more impactful than parallelizing the rest. For example, if 5% of the computation is inherently sequential, no matter how many processors are used, we cannot speed up the program more than 20x. This is espeically important when one design algorithms for leadership HPC (e.g., DOE Frontier). It corresponds to “strong scaling tests”.

Gustafson’s Law#

Recognizes that in order to fully utilize computing resources, problem size often needs to scale with the number of processors:

(476)#\[\begin{align} S(P) = P - f(P - 1) \end{align}\]

Assumes the workload increases with \(P\), thus avoiding Amdahl’s pessimism.

Implication: In practice, we often scale up problems as we add more resources. This law gives a more optimistic and realistic view in scientific computing, where we often increase the resolution or domain size with more computing power. It corresponds to “weak scaling tests”.

Flynn’s Taxonomy#

A classification of computer architectures:

SISD: Single Instruction Single Data (standard CPU)
SIMD: Single Instruction Multiple Data (vector processors, GPUs)
MISD: Rare, mostly theoretical
MIMD: Multiple Instruction Multiple Data (clusters, multicore CPUs)

Flynn’s taxonomy helps us map programming models to the underlying hardware. For instance, OpenMP typically targets MIMD systems with shared memory, while SIMD models underpin GPUs and vectorized CPU instructions

HPC Accounts#

We may be able to run some code on UA HPC. Let’s gather people’s netid on this Google Spreadsheet.

Monte Carlo Computation of \(\pi\)#

We will parallelize a simple algorithm using different techniques. The first algorithm is monte carlo computation of \(\pi\). This is an embarrassingly parallel problem. So not much actual algorithm consideration is needed. We main use it to get ourselve familiar with different tools.

Python Series Code#

Here is the algorithm in native python:

import random

def mcpi_loop(n_total):
    n_inside = 0
    for _ in range(n_total):
        x, y = random.random(), random.random()
        if x*x + y*y < 1.0:
            n_inside += 1
    return n_inside

pi = 4 * mcpi_loop(1000_000) / 1000_000
print(pi)

3.14368

%timeit mcpi_loop(1000_000)

160 ms ± 237 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On my laptop it takes about 80ms to perform 1M samples. The number of significant digits is \(\sim \log_{10}\sqrt{N} = 3\).

Embarrassingly Parallel Computing#

Since this algorithm is embarrassingly parallelizable, we can simply run it multiple times and compute the mean. Let’s do this as a class exercise using this Google Sheet.

Effectively, we just did a weak scaling test!

Numpy Parallel Code#

When compiled with BLAS backend, Numpy automatically distribute compute across multiple cores.

import numpy as np

def mcpi_numpy(n_total):
    x = np.random.rand(n_total)
    y = np.random.rand(n_total)
    return np.sum(x*x + y*y < 1.0)

pi = 4 * mcpi_numpy(1000_000) / 1000_000
print(pi)

3.142024

%timeit mcpi_numpy(1000_000)

18.6 ms ± 84.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

JAX Parallel Code#

Many operations in JAX, especially linear algebra related, automatically use multiple cores.

from jax import numpy  as jnp
from jax import random as jrd

def mcpi_jax(n_total):
    key = jrd.key(0)
    key1, key2 = jrd.split(key)
    
    x = jrd.uniform(key1, n_total)
    y = jrd.uniform(key2, n_total)
    return jnp.sum(x*x + y*y < 1.0)

pi = 4 * mcpi_jax(1000_000) / 1000_000
print(pi)

3.142476

%timeit mcpi_jax(1000_000)

7.37 ms ± 68 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

C Series Code#

Our original python code can be easily translate to C. Please put the following code into a new file mcpi_loop.c:

#include <stdlib.h>

int mcpi_loop(int n_total)
{
	int n_inside = 0;
	for(int i = 0; i < n_total; ++i) {
		double x = (double)rand() / RAND_MAX;
		double y = (double)rand() / RAND_MAX;

		if(x * x + y * y < 1.0)
			++n_inside;
	}

	return n_inside;
}

Although the actual C code is pretty simple, unfortunately, there is not really a portable, high-resolution time functionon in bash. Hence, we need to do timing ourselves in the C code:

#include <time.h>

static struct timespec t0;

void tik()
{
	clock_gettime(CLOCK_REALTIME, &t0);
}

double tok()
{
	struct timespec t1, dt;
	clock_gettime(CLOCK_REALTIME, &t1);

	dt.tv_nsec = t1.tv_nsec - t0.tv_nsec;
	dt.tv_sec  = t1.tv_sec  - t0.tv_sec;
	if(dt.tv_nsec < 0) {
		dt.tv_nsec += 1000000000;
		dt.tv_sec--;
	}

	int ms = dt.tv_nsec / 1000000 + dt.tv_sec * 1000;
	int ns = dt.tv_nsec % 1000000;

	return ms + 1e-6 * ns;
}

Then we can put it all together and create the main function:

#include <stdio.h>

int main()
{
	tik();

	double pi = 4.0 * mcpi_loop(1000000) / 1000000;
	double ms = tok();

	printf("pi = %f\n", pi);
	printf("dt = %f ms\n", ms);

	return 0;
}

We can put all these functions into a single “mcpi_loop.c” file and then compile with:

gcc mcpi_loop.c -O3 -o mcpi_loop
./mcpi_loop

On my laptop, the run takes ~ 36 ms.

Shared Memory: OpenMP C Code#

OpenMP is a widely used API for writing multithreaded applications in C, C++, and Fortran. It allows you to add simple compiler directives (pragmas) to enable parallel execution on shared-memory systems.

Its key features include:

Easy to parallelize loops using #pragma omp parallel for
Threads share memory, so synchronization and data races must be handled carefully
OpenMP includes mechanisms for reductions, barriers, critical sections, and more

To use OpenMP:

Include the header #include <omp.h>
Compile with gcc -fopenmp

Here is our updated mcpi_omp() function:

#include <omp.h>
#include <stdlib.h>

int mcpi_omp(int n_total)
{
	int n_inside = 0;

	#pragma omp parallel for reduction(+:n_inside)
	for(int i = 0; i < n_total; ++i) {
		double x = (double)rand() / RAND_MAX;
		double y = (double)rand() / RAND_MAX;

		if(x * x + y * y < 1.0)
			++n_inside;
	}

	return n_inside;
}

Copying it with tik(), tok(), and main() to “mcpi_omp.c”, we can now compile our OpenMP version by

gcc mcpi_omp.c -fopenmp -O3 -o mcpi_omp
./mcpi_omp

On my laptop, the run takes ~ 8 ms, matching the numpy implementation.

Distributed Memory: MPI#

MPI (Message Passing Interface) is the de facto standard for distributed-memory parallel programming. Unlike OpenMP, where threads share memory, MPI processes have separate memory spaces and must explicitly communicate using messages.

MPI is well-suited for large-scale clusters and supercomputers, where each node has its own memory and processors.

Common MPI functions include:

MPI_Init() and MPI_Finalize() for starting and ending MPI programs
MPI_Comm_rank() and MPI_Comm_size() to identify each process and the total number of processes
MPI_Send(), MPI_Recv(), and MPI_Reduce() for data exchange and aggregation

Important note: One of the most commonly used implementations of MPI is OpenMPI. Despite the similar name, OpenMPI is unrelated to OpenMP. The name similarity is unfortunately confusing!

OpenMP is for multithreading on shared-memory systems
OpenMPI is a software implementation of the MPI standard for distributed-memory systems

To compile and run an MPI program, use mpicc to compile the program and use mpirun -np N ./program to run it with N processes.

Here is an example implementation:

#include <mpi.h>

int main(int argc, char** argv)
{
	tik();

	MPI_Init(&argc, &argv);

	int rank, size;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	int n_total = 1000000;
	int l_total = n_total / size;

	srand(time(NULL)+rank); // ensure different seed per process
	int l_inside = mcpi_loop(l_total);

	int n_inside = 0;
	MPI_Reduce(&l_inside, &n_inside, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

	if(rank == 0) {
		double pi = 4.0 * n_inside / n_total;
		printf("Estimated value of pi: %f\n", pi);
	}

	MPI_Finalize();

	double ms = tok();
	if(rank == 0)
		printf("dt = %f ms\n", ms);

	return 0;
}

On my laptop, when I run mpirun -np 4 ./mcpi_mpi, it took ~ 25 ms. This is significantly higher than the OpenMP version because of the all the startup overhead.

Job Submission to HPC with Slurm#

On many HPC clusters, job execution is managed by a workload manager. One of the most widely used systems is Slurm (Simple Linux Utility for Resource Management).

Slurm schedules jobs across compute nodes and handles resource allocation, job queues, and execution environments. Instead of running MPI jobs interactively with mpirun, users typically submit jobs using a Slurm script.

The UA HPC website has some concise documentation of using SLURM.

And here is a sample submission script “submit.sh” that we may use

#!/bin/bash

#SBATCH --job-name=mcpi_mpi
#SBATCH --ntasks=4
#SBATCH --nodes=1
#SBATCH --mem=1gb
#SBATCH --time=00:05:00
#SBATCH --partition=standard
#SBATCH --account=2025q1-phys305
 
# SLURM Inherits your environment. cd $SLURM_SUBMIT_DIR not needed
pwd; hostname; date
 
module load openmpi3
/usr/bin/time -o mcpi_mpi.time mpirun -np 4 mcpi_mpi

To submit the job, simply run

sbatch submit.sh

To check your job, run

squeue -u USRENAME