Standard Simulation Tool of HEP
Geant4 simulates particles travelling through matter
Geant4 Approach
Very General and Capable Tool
Huge CPU Memory+Time Expense
Not a Photo, a Calculation
Much in common : geometry, light sources, optical physics
Many Applications of ray tracing :
10 Giga Rays/s |
Offload Ray Trace to Dedicated HW
SM : Streaming Multiprocessor
BVH : Bounding Volume Hierarchy
RTX Platform : Hybrid Rendering
-> real-time photoreal cinematic 3D rendering
Tree of Bounding Boxes (bbox)
OptiX Raytracing Pipeline
Analogous to OpenGL rasterization pipeline:
OptiX makes GPU ray tracing accessible
NVIDIA expertise:
Opticks provides (Yellow):
[1] Turing RTX GPUs
GPU Resident Photons
Thrust: high level C++ access to CUDA
OptiX : single-ray programming model -> line-by-line translation
Outside/Inside Unions
dot(normal,rayDir) -> Enter/Exit
Complete Binary Tree, pick between pairs of nearest intersects:
UNION tA < tB | Enter B | Exit B | Miss B |
---|---|---|---|
Enter A | ReturnA | LoopA | ReturnA |
Exit A | ReturnA | ReturnB | ReturnA |
Miss A | ReturnB | ReturnB | ReturnMiss |
Materials/Surfaces -> GPU Texture
Material/Surface/Scintillator properties
Material/surface boundary : 4 indices
Primitives labelled with unique boundary index
simple/fast properties + reemission wavelength
G4 Structure Tree -> Instance+Global Arrays -> OptiX
Group structure into repeated instances + global remainder:
instancing -> huge memory savings for JUNO PMTs
Random Aligned Bi-Simulation
Same inputs to Opticks and Geant4:
Common recording into OpticksEvents:
Aligned random consumption, direct comparison:
Bi-simulations of all JUNO solids, with millions of photons
Primary sources of problems
Primary cause : float vs double
Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)
Conclude
Compression Essential
Domain compression to fit in VRAM
4-bit History Flags at Each Step
BT : boundary BR : boundary reflect SC : bulk scatter AB : bulk absorb SD : surface detect SA : surface absorb
Up to 16 steps of the photon propagation are recorded.
Photon Array : 4 * float4 = 512 bits/photon
Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record
Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.
Test Hardware + Software
Workstation
Software
IHEP GPU Cluster
Full JUNO Analytic Geometry j1808v5
Production Mode : does the minimum
Multi-Event Running, Measure:
Photon Launch Size : VRAM Limited
NVIDIA Quadro RTX 8000 (48 GB)
400M photons x 112 bytes ~ 45G
JUNO analytic, 400M photons from center | Speedup | |
---|---|---|
Geant4 Extrap. | 95,600 s (26 hrs) | |
Opticks RTX ON (i) | 58 s | 1650x |
JUNO analytic, 400M photons from center | Speedup | |
---|---|---|
Opticks RTX ON (i) | 58s | 1650x |
Opticks RTX OFF (i) | 275s | 350x |
Geant4 Extrap. | 95,600s (26 hrs) |
5x Speedup from RTX with JUNO analytic geometry |
100M photon RTX times, avg of 10
Launch times for various geometries | |||
---|---|---|---|
Geometry | Launch (s) | Giga Rays/s | Relative to ana |
JUNO ana | 13.2 | 0.07 | |
JUNO tri.sw | 6.9 | 0.14 | 1.9x |
JUNO tri.hw | 2.2 | 0.45 | 6.0x |
Boxtest ana | 0.59 | 1.7 | |
Boxtest tri.sw | 0.62 | 1.6 | |
Boxtest tri.hw | 0.30 | 3.3 | 1.9x |
JUNO 15k triangles, 132M without instancing
Simple Boxtest geometry gets into ballpark
OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid
NVIDIA OptiX 7 : Entirely new API
JUNO+Opticks into Production
Geant4+Opticks Integration : Work with Geant4 Collaboration
Alpha Development ------>-----------------> Robust Tool
How is >1500x possible ?
Progress over 30 yrs, Billions of Dollars
Photon Simulation ideally suited to GPU
Three revolutions reinforcing each other:
Deep rivers of development, ripe for re-purposing
Example : DL denoising for faster ray trace convergence
Re-evaluate long held practices in light of new realities:
Highlights
Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.
- Drastic speedup -> better detector understanding -> greater precision
- any simulation limited by optical photons can benefit
- more photon limited -> more overall speedup (99% -> 100x)
https://bitbucket.org/simoncblyth/opticks | code repository |
https://simoncblyth.bitbucket.io | presentations and videos |
https://groups.io/g/opticks | forum/mailing list archive |
email:opticks+subscribe@groups.io | subscribe to mailing list |
S(n) Expected Speedup
optical photon simulation, P ~ 99% of CPU time
Must consider processing "big picture"
OpenGL Rasterization Pipeline
GPUs evolved to rasterize 3D graphics at 30/60 fps
Simple Array Data Structures (N-million,4)
Constant "Uniform" 4x4 matrices : scaling+rotation+translation
Graphical Experience Informs Fast Computation on GPUs
Waiting for memory read/write, is major source of latency...
Optical Photon Simulation
~perfect match for GPU acceleration
How Many Threads to Launch ?
Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext
NVIDIA Titan V: 80 SM, 5120 CUDA cores
Array Serialization Benefits
Persist everything to file -> fast development cycle
Can transport everything across network:
Arrays for Everything -> direct access debug
Object-oriented : mixes data and compute
Array-oriented : separate data from compute
NumPy : standard array handling package
https://realpython.com/numpy-array-programming/
CuPy : Simplest CUDA Interface
"Production" CuPy ? Depends on requirements:
C++ Based Interfaces to CUDA
Mature NVIDIA Basis Libraries
RAPIDS : New NVIDIA "Suite" of open source data science libs
Milestones over 50 years of CG
Improving image realism, speed
Image rendering : Applied photon simulation
(2018) NVIDIA RTX
Project Sol : NVIDIA RTX Demo
real-time cinematic raytracing on single GPU ( NVIDIA RTX 2080Ti)
Monte Carlo Path Tracing
Movies ≈ monte carlo optical photon simulations
Monte Carlo Path Tracing
NVIDIA OptiX AI Denoiser
https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-sequences-using-recurrent-denoising
Free Online Book
CG Rendering "Simulation" | Particle Physics Simulation |
---|---|
simulates: image formation, vision | simulates photons: generation, propagation, detection |
(red, green, blue) | wavelength range eg 400-700 nm |
ignore polarization | polarization vector propagated throughout |
participating media: clouds,fog,fire [1] | bulk scattering: Rayleigh, MIE |
human exposure times | nanosecond time scales |
equilibrium assumption | transient phenomena |
ignores light speed, time | arrival time crucial, speed of light : 30 cm/ns |
Despite differences many techniques+hardware+software directly applicable to physics eg:
Potentially Useful CG techniques for "billion photon simulations"
[1] search for: "Volumetric Rendering Equation"
Re-usable photon "snapshots" ?
GPU "snapshot" cache data structure:
Opticks as drop in fast replacement for Geant4
Full+fast GPU accelerated simulation:
Re-usage is caching optimization, still need full propagation:
130 __device__ void rayleigh_scatter(Photon &p, curandState &rng)
131 {
137 float3 newDirection, newPolarization ;
139 float cosTheta ;
141 do {
145 newDirection = uniform_sphere(&rng);
146 rotateUz(newDirection, p.direction );
151
152 float constant = -dot(newDirection,p.polarization);
153 newPolarization = p.polarization + constant*newDirection ;
154
155 // newPolarization
156 // 1. transverse to newDirection (as that component is subtracted)
157 // 2. same plane as old p.polarization and newDirection (by construction)
158 //
... ... corner case elided ...
182 if(curand_uniform(&rng) < 0.5f) newPolarization = -newPolarization ;
184
185 newPolarization = normalize(newPolarization);
189 cosTheta = dot(newPolarization,p.polarization) ;
190
191 } while ( cosTheta*cosTheta < curand_uniform(&rng)) ;
192
193 p.direction = newDirection ;
194 p.polarization = newPolarization ;
195 }
Have to persist the polarization vector, to truly resume a propagation
https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/rayleigh.h
Virtual shell OR scatter-based ?
What is VRAM of available GPUs ?
Literature Search/Learning
Gain Experience
-> informed decisions
Where/when/what to collect ?
Too many options: experimentation needed to iterate towards solution
[1] RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location https://www.willusher.io/publications/rtx-points
Highlights
Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, eliminating memory and time bottlenecks.
- neutrino telescope simulation can benefit drastically from Opticks
- Drastic speedup -> better detector understanding -> greater precision
- more photon limited -> more overall speedup ( 99.9% -> 1000x )
- graphics : rich source of techniques, inspiration, CUDA code to try
https://bitbucket.org/simoncblyth/opticks | code repository |
https://simoncblyth.bitbucket.io | presentations and videos |
https://groups.io/g/opticks | forum/mailing list archive |
email:opticks+subscribe@groups.io | subscribe to mailing list |