Skip to content

Genofabio/gpu-flocking-optimization

Repository files navigation

GPU Boid Simulation – Prey, Predators & Leaders

Language Compute Graphics Platform

Introduction

This project focuses on the optimization of a flocking simulation with prey–predator dynamics. Flocking models are widely used to reproduce the collective movement of animal groups such as schools of fish, flocks of birds, or herds. Each agent (boid) follows a few simple rules—alignment, cohesion, and separation—that together generate complex emergent group behavior.

Our implementation combines OpenGL for visualization and CUDA for GPU acceleration. We first developed a CPU-based version, then redesigned and optimized it to leverage GPU parallelism, achieving real-time performance with larger and more dynamic groups.

The project was inspired by a University of Pennsylvania assignment outline, which we used as a starting point and challenge to build our own full implementation.


CPU Baseline GPU Optimized (CUDA)
Image CPU Baseline

Features

  • Flocking Forces: Cohesion (moving toward the center of nearby companions), Separation (maintaining safe distances), and Alignment (adjusting velocity to match neighbors).
  • Three types of agents:
    • Prey → follow flocking rules, evade predators, follow leaders
    • Leaders → guide prey, avoid predators and other leaders
    • Predators → chase prey, keep distance from other predators
  • Obstacles (walls) → static obstacles that boids anticipate using a look-ahead vector along their velocity. A repulsive force, scaled quadratically by proximity, allows boids to turn smoothly before collision.
  • CPU & GPU implementations → fair comparison of performance
  • Profiler → measures execution times, FPS, exports results to CSV

GPU Optimizations

  • Structure of Arrays (SoA): Original boid structure was split into separate arrays to ensure threads in the same warp access contiguous memory, improving coalesced memory access.
  • GPU Constant Memory: Used to store simulation constants (grid size, world dimensions, interaction coefficients) to reduce redundant global memory reads during the update phase.
  • Uniform spatial grid with boid reordering → reduces neighbor checks from O(N²) to O(N·k).
  • Shared memory caching: Temporarily stores neighboring boids within the same grid cell to reduce expensive global memory accesses.
  • CUDA streams: Different interaction rules (boid-boid, wall repulsion, leader following, predator-prey) are dispatched to separate CUDA streams to overlap execution and improve throughput. Rendering buffers are also split into chunks and transferred using cudaMemcpyAsync on multiple streams.
  • Enhanced profiler using CUDA events for fine-grained, millisecond-level performance metrics.

Benchmarks (Performance)

To measure execution times and pinpoint bottlenecks, we developed an enhanced unified Profiler. It utilizes high-resolution timers (std::chrono) for the CPU and precise CUDA events for millisecond-level accuracy on the GPU. Results can be logged, averaged over multiple runs, and exported to CSV for offline analysis.

Our profiling reveals that rendering remains extremely fast and scales well (consistently under 3 ms on CPU and under 1 ms on GPU). The primary bottleneck of the simulation is the physics computation - specifically summing forces, sorting, and memory operations for inter-boid interactions.

CPU vs. GPU Comparison

  • CPU limitations: The CPU implementation quickly struggles, managing around 30 FPS with only 450 preys. Adding just 50 more entities drastically reduces the performance to ~16-17 FPS.
  • GPU scalability: The GPU efficiently offloads physics computations, successfully handling 10x more agents (+900% capacity) at the same framerate. It effortlessly simulates 4,500 preys while maintaining 31-32 FPS.
Platform Number of Preys Compute Forces (ms) Total Update (ms) Render (ms) Actual FPS
CPU 450 ~32 - 33 32.90 2.38 30
CPU 500 ~43 - 44 43.28 2.91 16 - 17
GPU 4,500 20.28 31.83 0.76 31 - 32
GPU 8,000 42.15 60.08 0.83 15 - 17

About

A high-performance C++/CUDA flocking simulation with prey-predator-leader dynamics. Optimized with spatial grids, SoA, and shared memory for real-time 8000+ boids.

Topics

Resources

Stars

Watchers

Forks

Contributors