Leaveaging GPUs
Pseudocode
The psuedocode that implements the diffusion loop is:
1. For each timestep from 1 to num_timesteps:
2. Copy the current temperature values to a temporary array (temp_copy)
3. Initialize arrays for neighbor sums and neighbor counts with zeros
4. For each valid cell (ignoring boundaries):
5. Calculate the sum of neighboring cells:
- Add the value of the front neighbor if valid
- Add the value of the back neighbor if valid
- Add the value of the left neighbor if valid
- Add the value of the right neighbor if valid
- Add the value of the top neighbor if valid
- Add the value of the bottom neighbor if valid
6. Count the number of valid neighbors for each direction
7. Update the cell's temperature:
- New temperature = current temperature + diffusion coefficient * (neighbor_sum - 6 * current temperature) / neighbor_count
8. Ensure invalid points (NaN) remain unchanged
9. Update the main temperature array with the new values
Running with NumPy
poetry run diffusion_numpy --num_timesteps 100
The above command will run the 3D diffusion model using the NumPy version of the code for 100 timesteps. Once the execution has finished then a report will be provided concerning the time taken for execution. When running on an AMD EPYC 7552 48-Core Processor, the execution outputs:
NumPy model completed in 489.2647 seconds. Average time per timestep: 4.8926 seconds.
You can visualise the model outputs producded with
poetry run visualise_slice --target_depth 0 --animation_speed 100 --data_file predicted_temperatures_numpy.nc
Of note is that the file predicted_temperatures_numpy.nc
is generated during the execution of the above command for the script diffusion_numpy
. This will then generate a new interactive HTML file output/predicted_temperature_2d_interactive.html
.
Running With CuPy
As the same code has been wrote in CuPy you can experiment with the difference between CPU and GPU code with the following:
poetry run diffusion_cupy --num_timesteps 100
The above command will run the 3D diffusion model using the CuPy version of the code for 100 timesteps. Once the execution has finished then a report will be provided concerning the time taken for execution. When running on an NVIDIA A40 GPU, the execution outputs:
CuPy model completed in 171.9884 seconds. Average time per timestep: 1.7199 seconds.
You can visualise the model outputs producded with
poetry run visualise_slice --target_depth 0 --animation_speed 100 --data_file predicted_temperatures_cupy.nc
Of note is that the file predicted_temperatures_numpy.nc
is generated during the execution of the above command for the script diffusion_numpy
. This will then generate a new interactive HTML file output/predicted_temperature_2d_interactive.html
.
Performance Comparison: CPU vs GPU
Overall Speedup
- CPU runtime: 489 seconds
- GPU runtime: 171.9884 seconds
- Speedup factor:
[ \text{Speedup} = \frac{\text{CPU time}}{\text{GPU time}} = \frac{489}{171.9884} \approx 2.84 ]
The GPU completed the task approximately 2.84 times faster than the CPU.
Per-Timestep Speedup
- CPU average timestep: 4.9 seconds
- GPU average timestep: 1.7199 seconds
- Speedup factor per timestep:
[ \text{Speedup per timestep} = \frac{\text{CPU timestep}}{\text{GPU timestep}} = \frac{4.9}{1.7199} \approx 2.85 ]
On a per-timestep basis, the GPU is about 2.85 times faster.
Efficiency Observation
- The consistent speedup factor (both overall and per timestep) suggests that the GPU effectively parallelizes computations without significant overhead from data transfer or kernel launches.
Implications
- Computational Efficiency:
Using a GPU provides substantial performance gains, especially for tasks with repetitive, parallelizable computations such as numerical modeling or simulations. - Observed Speedup (~2.84x improvement) suggests:
- The task is well-suited for GPU acceleration.
- Full potential of the GPU might not yet be realized due to:
- Limited parallelism in the workload.
- Overheads from memory transfers between CPU and GPU.
- Suboptimal use of GPU-specific optimizations.
The GPU's performance significantly outpaces the CPU for this task, reducing runtime by approximately 65%. Of note is that this approach is simply a direct move from NumPy to CuPy which represents a minimal amount of effort. Further optimization of the GPU code could enhance performance and exploit its full potential, leveraging on known time intensive tasks for GPUs such as data transfer.