Leaveaging GPUs

Pseudocode

The psuedocode that implements the diffusion loop is:

1. For each timestep from 1 to num_timesteps:
   2. Copy the current temperature values to a temporary array (temp_copy)
   3. Initialize arrays for neighbor sums and neighbor counts with zeros
   4. For each valid cell (ignoring boundaries):
      5. Calculate the sum of neighboring cells:
         - Add the value of the front neighbor if valid
         - Add the value of the back neighbor if valid
         - Add the value of the left neighbor if valid
         - Add the value of the right neighbor if valid
         - Add the value of the top neighbor if valid
         - Add the value of the bottom neighbor if valid
      6. Count the number of valid neighbors for each direction
   7. Update the cell's temperature:
      - New temperature = current temperature + diffusion coefficient * (neighbor_sum - 6 * current temperature) / neighbor_count
   8. Ensure invalid points (NaN) remain unchanged
   9. Update the main temperature array with the new values

Running with NumPy

poetry run diffusion_numpy --num_timesteps 100

The above command will run the 3D diffusion model using the NumPy version of the code for 100 timesteps. Once the execution has finished then a report will be provided concerning the time taken for execution. When running on an AMD EPYC 7552 48-Core Processor, the execution outputs:

NumPy model completed in 489.2647 seconds. Average time per timestep: 4.8926 seconds.

You can visualise the model outputs producded with

poetry run visualise_slice --target_depth 0 --animation_speed 100 --data_file predicted_temperatures_numpy.nc

Of note is that the file predicted_temperatures_numpy.nc is generated during the execution of the above command for the script diffusion_numpy. This will then generate a new interactive HTML file output/predicted_temperature_2d_interactive.html.

Running With CuPy

As the same code has been wrote in CuPy you can experiment with the difference between CPU and GPU code with the following:

poetry run diffusion_cupy --num_timesteps 100

The above command will run the 3D diffusion model using the CuPy version of the code for 100 timesteps. Once the execution has finished then a report will be provided concerning the time taken for execution. When running on an NVIDIA A40 GPU, the execution outputs:

CuPy model completed in 171.9884 seconds. Average time per timestep: 1.7199 seconds.

You can visualise the model outputs producded with

poetry run visualise_slice --target_depth 0 --animation_speed 100 --data_file predicted_temperatures_cupy.nc

Of note is that the file predicted_temperatures_numpy.nc is generated during the execution of the above command for the script diffusion_numpy. This will then generate a new interactive HTML file output/predicted_temperature_2d_interactive.html.

Performance Comparison: CPU vs GPU

Overall Speedup

CPU runtime: 489 seconds
GPU runtime: 171.9884 seconds
Speedup factor:
[ \text{Speedup} = \frac{\text{CPU time}}{\text{GPU time}} = \frac{489}{171.9884} \approx 2.84 ]
The GPU completed the task approximately 2.84 times faster than the CPU.

Per-Timestep Speedup

CPU average timestep: 4.9 seconds
GPU average timestep: 1.7199 seconds
Speedup factor per timestep:
[ \text{Speedup per timestep} = \frac{\text{CPU timestep}}{\text{GPU timestep}} = \frac{4.9}{1.7199} \approx 2.85 ]
On a per-timestep basis, the GPU is about 2.85 times faster.

Efficiency Observation

The consistent speedup factor (both overall and per timestep) suggests that the GPU effectively parallelizes computations without significant overhead from data transfer or kernel launches.

Implications

Computational Efficiency:
Using a GPU provides substantial performance gains, especially for tasks with repetitive, parallelizable computations such as numerical modeling or simulations.
Observed Speedup (~2.84x improvement) suggests:
The task is well-suited for GPU acceleration.
Full potential of the GPU might not yet be realized due to:
- Limited parallelism in the workload.
- Overheads from memory transfers between CPU and GPU.
- Suboptimal use of GPU-specific optimizations.

The GPU's performance significantly outpaces the CPU for this task, reducing runtime by approximately 65%. Of note is that this approach is simply a direct move from NumPy to CuPy which represents a minimal amount of effort. Further optimization of the GPU code could enhance performance and exploit its full potential, leveraging on known time intensive tasks for GPUs such as data transfer.