Opticks is an open source project that applies state-of-the-art GPU ray tracing from NVIDIA OptiX to optical photon simulation and integrates this with Geant4. This results in drastic speedups of more than 1500 times single threaded Geant4.
Any simulation limited by optical photons can remove those limits by using Opticks.
This render shows the photons resulting from a muon crossing the JUNO scintillator, each line represents a single photon.
The number of photons across track lengths up to 35m in the scintillator is about 70M
This will be a talk of two halves:
JUNO will be the worlds largest liquid scintillator detector, with a spherical 20,000 ton volume of scintillator surrounded by a water pool buffer which also provides water cherenkov detection.
The scintillator is instrumented with 18 thousand 20-inch PMTs and 25 thousand 3-inch PMTs
JUNO will be able to detect neutrinos from many terrestrial and extra-terrestrial sources including : solar, atmospheric, geo-neutrinos.
Despite 700 m of overburden the largest backgrounds to these neutrino signals will be from cosmic muon induced processes.
A muon veto is used to control the backgrounds.
However to minimize the time and volume vetoed, it is necessary to have a good muon reconstruction which means that we need large samples of cosmic muons.
Geant4 is the standard toolkit used to simulate detectors across several fields.
Standard Simulation Tool of HEP
Geant4 simulates particles travelling through matter
Geant4 Approach
Very General and Capable Tool
Geant4 is a very general tool : but it is mostly not needed for the simulation of optical photons
Huge CPU Memory+Time Expense
Muons travelling across the liquid scintillator will yield many tens of millions of optical photons. This is a huge memory and time challenge for Geant4 monte carlo production.
Most of the CPU time is taken finding intersections between photons and geometry Fortunately simulation is not alone in this bottleneck.
Optical photons are naturally parallel, with only Cherenkov and Scintillation production being relevant and we are only interested in photons collected at PMTs.
These characteristics make it straightforward to integrate an external optical simulation.
Not a Photo, a Calculation
Much in common : geometry, light sources, optical physics
Many Applications of ray tracing :
Optical photon simulation and ray traced image rendering have a lot in common.
They are both limited by ray geometry intersection (or ray tracing)
With simulation you want to know photon parameters at PMTs, with rendering you need pixel values at the image plane.
Both these are limited by ray geometry intersection, which is also known as ray tracing.
Ray tracing is used across many industries, which means that are huge efforts across decades to improve ray tracing perfromance.
It is good to clarify the difference between the two primary graphics rendering techniques
Rasterization is the most common rendering technique
Ray tracing
Ray tracing is an overloaded term. In some contexts it means just the ray transport from an origin to an intersection. But is also refers more generally to a specific rendering technique and even more generally a class of rendering techniques.
| 10 Giga Rays/s |
Two years ago NVIDIA announced a leap in ray tracing performance with the Quadro RTX GPU : which it calls the worlds first ray tracing GPU As it has hardware dedicated to accelerating ray tracing.
NVIDIA claims it can reach 10 billion ray geometry intersections per second with a single GPU.
Assuming each simulated photon costs 10 rays, that means the upper limit per GPU is 1 billion photons/second.
Offload Ray Trace to Dedicated HW
SM : Streaming Multiprocessor
BVH : Bounding Volume Hierarchy
The performance jump is done by offloading ray tracing from the general purpose SM (streaming multiprocessor) to the fixed function RT core, which frees up the SM.
RTX Platform : Hybrid Rendering
-> real-time photoreal cinematic 3D rendering
To gamers and movie makers these RTX GPUs enable:
RTX is hybrid rendering using:
Tree of Bounding Boxes (bbox)
The principal technique to accelerate ray geometry intersection is an acceleration structure called a bounding volume hierarchy
This divides space into progressively smaller boxes which forms a spatial index.
Traversing the tree of bounds allows to minimize tests needed to find an intersect.
With some geometry it is possible for the traversal to be done on the dedicated RT cores.
OptiX Raytracing Pipeline
Analogous to OpenGL rasterization pipeline:
OptiX makes GPU ray tracing accessible
NVIDIA expertise:
Opticks provides (Yellow):
[1] Turing RTX GPUs
OptiX creates GPU kernels by compiling together:
most of Opticks on the GPU is in the ray generation and intersection programs
SMALL So : how can an external optical photon simulation be integrated with Geant4 ?
In the standard workflow the Geant4 Scintillation and Cerenkov processes calculate a number of photons and then loop generating these and collecting them as secondaries
In the hybrid workflow, this generation is split between the CPU and GPU with "Gensteps" acting as the bridge. These Genstep parameters include the number of photons, positions and everything else needed in the generation loop.
The result is a very simple port of the generation loop to the GPU.
Its doubly helpful to generate photons on GPU, as then they take no CPU memory.
So can entirely offload photon memory to the GPU with only hits needing CPU memory.
Also this keeps the overheads low as gensteps are typically a factor of 100 smaller than photons.
The geometry is also needed on the GPU, with all material and surface properties.
GPU Resident Photons
Thrust: high level C++ access to CUDA
OptiX : single-ray programming model -> line-by-line translation
This repeats what I just explained on the diagram
Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...
Geometry starts from primitive shapes.
NVIDIA OptiX doesnt provide primitives : My Opticks has ray geometry intersection for these shapes implemented with polynomial root finding.
Outside/Inside Unions
dot(normal,rayDir) -> Enter/Exit
Complete Binary Tree, pick between pairs of nearest intersects:
| UNION tA < tB | Enter B | Exit B | Miss B |
|---|---|---|---|
| Enter A | ReturnA | LoopA | ReturnA |
| Exit A | ReturnA | ReturnB | ReturnA |
| Miss A | ReturnB | ReturnB | ReturnMiss |
Opticks includes a CUDA CSG (constructive solid geometry) implemented beneath the level of OptiX primitives. So can intersect with complex compound shapes.
G4Boolean trees can be translated into Opticks without any approximation.
Materials/Surfaces -> GPU Texture
Material/Surface/Scintillator properties
Material/surface boundary : 4 indices
Primitives labelled with unique boundary index
simple/fast properties + reemission wavelength
G4 Structure Tree -> Instance+Global Arrays -> OptiX
Group structure into repeated instances + global remainder:
instancing -> huge memory savings for JUNO PMTs
The Opticks geometry model starts from the observation that there is lots of repetition in the geometry. It automatically finds these repeats.
Bringing optical physics to the GPU was straightforward, because a direct translation could be used.
The Geant4 geometry model is vastly different to the whats needed on the GPU : making geometry translation the most challenging aspect of Opticks.
And everything needs to be serialized to be copied to the GPU.
The upshot is that full Geant4 detector geometries can be automatically translated into NVIDIA OptiX geometries.
This is an OptiX ray trace image from the chimney region at the top of the JUNO scintillator sphere.
This is an OpenGL rasterized image, using the approximate triangulated geometry.
Opticks manages analytic and triangulated geometry together.
Random Aligned Bi-Simulation
Same inputs to Opticks and Geant4:
Common recording into OpticksEvents:
Aligned random consumption, direct comparison:
Bi-simulations of all JUNO solids, with millions of photons
Primary sources of problems
Primary cause : float vs double
Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)
Conclude
Opticks is validated by comparison with Geant4.
Random Aligned bi-simulation (sidebar) allows direct step-by-step comparison of simulations unclouded by statistical variation.
So issues show up very clearly.
Comparing individual solids shows discrepancies at the fraction of a percent level.
Main cause is float vs double.
This GUI allows interactive selection between tens of millions of photons based on their histories.
Here its showing the photons that scattered before boundary transmitting straight through to surface detect.
Its implemented by indexing the photon histories using some very fast GPU big integer sorting provided by CUDA Thrust, and using OpenGL geometry shaders to switch between selections.
The 64-bit integers hold up to 16 4-bit flags for each step of the photon.
All of this is done using interop capabilities of OpenGL/CUDA/Thrust and OptiX so GPU buffers can be written to and rendered inplace with no copying around.
Compression Essential
Domain compression to fit in VRAM
4-bit History Flags at Each Step
BT : boundary BR : boundary reflect SC : bulk scatter AB : bulk absorb SD : surface detect SA : surface absorb
Up to 16 steps of the photon propagation are recorded.
Photon Array : 4 * float4 = 512 bits/photon
Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record
Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.
When you have millions of photons it is important to consider compression techniques.
I mention this detail, because compression of photons is essential when considering how to make propagations re-usable.
The GUI also provides interactive time scrubbing of the propagation of tens of millions of photons.
This is some nanoseconds later for a different history category.
I created this GUI to help with debugging the simulation.
Test Hardware + Software
Workstation
Software
IHEP GPU Cluster
Full JUNO Analytic Geometry j1808v5
Production Mode : does the minimum
Multi-Event Running, Measure:
Emitting millions of photons from the center of the scintillator and timing the interval and launch times of the propagation provides a measure of the performance of a geometry.
By interval, I mean the time between suceessive launches : so this covers all the overheads of copying the gensteps to the GPU and pulling back the hits to the CPU.
Overheads are less than 10%
The GPU used for these tests is the Quadro RTX 8000 with 48GB VRAM.
Xie-xie to NVIDIA China for loaning the card.
Photon Launch Size : VRAM Limited
NVIDIA Quadro RTX 8000 (48 GB)
400M photons x 112 bytes ~ 45G
The first check is that you get the expected number of hits as a function of the number of photons.
The photon parameters takes 64 bytes and curandState takes 48 bytes
So thats 112 bytes per photon, so the limit on the number of photons that can be simulated in a single launch with this 48G GPU is a bit more than 400M.
| JUNO analytic, 400M photons from center | Speedup | |
|---|---|---|
| Geant4 Extrap. | 95,600 s (26 hrs) | |
| Opticks RTX ON (i) | 58 s | 1650x |
This compares the extrapolated Geant4 propagation time with the Opticks launch interval with RTX on. The speedup is more than a factor of 1500. Need to use a log scale to make them both visible.
For 400M photons, Geant4 takes more than a day, Opticks takes less than a minute.
This is with analytic geometry. Speedup is a lot more with triangles.
| JUNO analytic, 400M photons from center | Speedup | |
|---|---|---|
| Opticks RTX ON (i) | 58s | 1650x |
| Opticks RTX OFF (i) | 275s | 350x |
| Geant4 Extrap. | 95,600s (26 hrs) | |
This is the same information shown as a ratio.
| 5x Speedup from RTX with JUNO analytic geometry |
Comparing RTX mode OFF to ON shows that the dedicated ray tracing hardware is giving a factor of 5.
100M photon RTX times, avg of 10
| Launch times for various geometries | |||
|---|---|---|---|
| Geometry | Launch (s) | Giga Rays/s | Relative to ana |
| JUNO ana | 13.2 | 0.07 | |
| JUNO tri.sw | 6.9 | 0.14 | 1.9x |
| JUNO tri.hw | 2.2 | 0.45 | 6.0x |
| Boxtest ana | 0.59 | 1.7 | |
| Boxtest tri.sw | 0.62 | 1.6 | |
| Boxtest tri.hw | 0.30 | 3.3 | 1.9x |
JUNO 15k triangles, 132M without instancing
Simple Boxtest geometry gets into ballpark
OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid
NVIDIA claims 10 GigaRays/s
As each photon costs around 10 rays that means 1 billion photons per second is the upper limit.
Performance you get is very sensitive to the geometry, both its complexity and how you model it. Because these result in different BVH.
And its also necessary to consider what can run in the RT cores.
The large dependence on geometry makes me hopeful that there is room for improvement by tuning the geometry modelling.
NVIDIA OptiX 7 : Entirely new API
JUNO+Opticks into Production
Geant4+Opticks Integration : Work with Geant4 Collaboration
Alpha Development ------>-----------------> Robust Tool
The next step is bringing Opticks into production usage within JUNO
There is considerable interest in Opticks by the Geant4 collaboration. The Fermilab Geant4 group is working on making an extended example for inclusion with the Geant4 distribution. The CERN Geant4 group is looking at the possibilities to use the Opticks geometry approach more widely, eg for gamma simulation in LHC calorimeters.
Opticks needs many more users and developers, to turn it into an robust tool.
There is also a challenge in the form of NVIDIA OptiX 7 which has drastically changed its API. A important multi-GPU feature is going away.
To regain this requires developing load balancing across multiple GPUs myself.
How is >1500x possible ?
Progress over 30 yrs, Billions of Dollars
Photon Simulation ideally suited to GPU
Three revolutions reinforcing each other:
Deep rivers of development, ripe for re-purposing
Example : DL denoising for faster ray trace convergence
Re-evaluate long held practices in light of new realities:
How it is possible for a > 1500 times speedup ?
Well, I think it is because of the success of Geant4 over 20 years have made it too easy just to continue using it, and hope Moores law is going to speed things up.
Meanwhile, billions of dollars of industry development have gone into improving ray tracing.
Liberating geometry from the Geant4 object model allows all this development effort to be applied to optical photon simulation.
Highlights
Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.
![]()
- Drastic speedup -> better detector understanding -> greater precision
- any simulation limited by optical photons can benefit
- more photon limited -> more overall speedup (99% -> 100x)
![]()
| https://bitbucket.org/simoncblyth/opticks | code repository |
| https://simoncblyth.bitbucket.io | presentations and videos |
| https://groups.io/g/opticks | forum/mailing list archive |
| email:opticks+subscribe@groups.io | subscribe to mailing list |
Here is the summary of the first half.
Opticks applies the best available GPU ray tracing to optical photon simulation resulting in speedups exceeding three orders of magnitude.
Opticks is still very young and it really needs users (and developers) to turn it into a robust tool that anyone with an optical photon simulation problem can use to elimate.
These speedups are just for the optical photons, how much that helps with the overall speedup depends on how limited you are by optical photons.
This is a 360 degree view of the all the JUNO central detector PMTs, which I used a raytracing benchmark.
Its an equirectangular projection.
Heres the outline of the 2nd half.
S(n) Expected Speedup
optical photon simulation, P ~ 99% of CPU time
Must consider processing "big picture"
The serial portion of processing determines the overall speedup because the parallel portion goes to zero
With neutrino telescopes I expect the situation will be more like 99.9% limited so there is potential for some really drastic overall speedups for you.
OpenGL Rasterization Pipeline
GPUs evolved to rasterize 3D graphics at 30/60 fps
Simple Array Data Structures (N-million,4)
Constant "Uniform" 4x4 matrices : scaling+rotation+translation
Graphical Experience Informs Fast Computation on GPUs
Rasterization is the process of going from input 3D vertices which are collections of 4 floats to 2d pixel values.
GPUs evolved to do this rasterization.
When using GPUs you should keep these origins in mind.
Waiting for memory read/write, is major source of latency...
Latency hiding works using hardware multi-threading, so when one group of threads is blocked waiting to read from global memory for example : other groups of threads are resumed.
This is only effective at hiding latency when there are enough other threads in flight at the same time.
If your processing is not parallel enough, you will not make effective use of the GPU
Optical Photon Simulation
~perfect match for GPU acceleration
How Many Threads to Launch ?
Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext
NVIDIA Titan V: 80 SM, 5120 CUDA cores
The main things that dictate how effective your use of the GPU is going to be ...
Many people guess that you should launch only as many threads as there are cores in the GPU ... but thats wrong, you need to launch large multiples of that to get best performance
The reason is latency hiding, which only works when there is abundant parallelism, when the GPU workload resembles that from 3D graphics you will get best performance.
Array Serialization Benefits
Persist everything to file -> fast development cycle
Can transport everything across network:
Arrays for Everything -> direct access debug
Object-oriented : mixes data and compute
Array-oriented : separate data from compute
NumPy : standard array handling package
Simplicity requirement comes from the need to do many things in parallel and also from the need to serialize everything in order to copy it to GPU global memory.
This constraint means must use Arrays
But that comes with advantages:
CuPy : Simplest CUDA Interface
"Production" CuPy ? Depends on requirements:
C++ Based Interfaces to CUDA
Mature NVIDIA Basis Libraries
RAPIDS : New NVIDIA "Suite" of open source data science libs
This slide is here just to highlight some projects that provide a quick start at using CUDA.
CuPy implememnts the NumPy API ontop of the CUDA stack. It is a great way to get started with CUDA
CuPy is quite a new project you might not have heard of, it builds ontop of the NVIDIA CUDA stack and implements the NumPy API powered by a CUDA implementation.
Graphics rendering has come a long way over the past 50 years, starting from the first use of ray casting in 1968.
This illustrates some of the milestones on graphics rendering.
Milestones over 50 years of CG
Improving image realism, speed
Image rendering : Applied photon simulation
(2018) NVIDIA RTX
Project Sol : NVIDIA RTX Demo
real-time cinematic raytracing on single GPU ( NVIDIA RTX 2080Ti)
Path tracing is short for : Monte Carlo Path Tracing
This is the abstract from a course held at the SIGGRAPH graphics conference
Monte Carlo Path Tracing
Movies ≈ monte carlo optical photon simulations
For many movies:
SMALL Computer graphics is all about finding solutions to the rendering equation.
The equation stems from energy conservation in equilibrium:
The lines on the images try to illustrate that:
You can see blue, green and red casts on the teapot coming from color bleeding : that arises from the multiple bounces or indirect lighting.
The rendering equation is a recursive integral equation
If you can solve it you can construct images from any viewpoint.
This form of recursive integral equation has a solution which is a sum of terms with progressively increasing bounces.
Top row shows individual contributions, with the cumulative sum on the bottom row.
Recursive integral eqn -> sum of integrals
The "Volumetric Rendering Equation" is a generalization to cover participating media eg clouds, fire, fog, skin
Monte Carlo Path Tracing
Monte Carlo Path Tracing
The technique has the usual monte carlo disadvantage of slow convergence. Much of computer graphics is about devising ways to bias the sampling and speed up the convergence.
NVIDIA OptiX includes a new AI based approach to removing noise.
NVIDIA OptiX AI Denoiser
https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-sequences-using-recurrent-denoising
This benefits from hardware dedicated to AI inferencing, the Tensor Cores.
Free Online Book
I hope I have got some of your interested to learn more about graphics. A good and very extensive book on physically based rendering is available for free at this url.
This book won an Oscar : for its usefulness to the film industry
| CG Rendering "Simulation" | Particle Physics Simulation |
|---|---|
| simulates: image formation, vision | simulates photons: generation, propagation, detection |
| (red, green, blue) | wavelength range eg 400-700 nm |
| ignore polarization | polarization vector propagated throughout |
| participating media: clouds,fog,fire [1] | bulk scattering: Rayleigh, MIE |
| human exposure times | nanosecond time scales |
| equilibrium assumption | transient phenomena |
| ignores light speed, time | arrival time crucial, speed of light : 30 cm/ns |
Despite differences many techniques+hardware+software directly applicable to physics eg:
Potentially Useful CG techniques for "billion photon simulations"
[1] search for: "Volumetric Rendering Equation"
There are great similarities between graphics rendering and physics simulation but also some clear differences.
Re-usable photon "snapshots" ?
GPU "snapshot" cache data structure:
Opticks as drop in fast replacement for Geant4
Full+fast GPU accelerated simulation:
Re-usage is caching optimization, still need full propagation:
SMALL 1st: full propagation
2nd: full propagation with "snapshots" for re-use (sidebar)
DISADVANTAGE:
130 __device__ void rayleigh_scatter(Photon &p, curandState &rng)
131 {
137 float3 newDirection, newPolarization ;
139 float cosTheta ;
141 do {
145 newDirection = uniform_sphere(&rng);
146 rotateUz(newDirection, p.direction );
151
152 float constant = -dot(newDirection,p.polarization);
153 newPolarization = p.polarization + constant*newDirection ;
154
155 // newPolarization
156 // 1. transverse to newDirection (as that component is subtracted)
157 // 2. same plane as old p.polarization and newDirection (by construction)
158 //
... ... corner case elided ...
182 if(curand_uniform(&rng) < 0.5f) newPolarization = -newPolarization ;
184
185 newPolarization = normalize(newPolarization);
189 cosTheta = dot(newPolarization,p.polarization) ;
190
191 } while ( cosTheta*cosTheta < curand_uniform(&rng)) ;
192
193 p.direction = newDirection ;
194 p.polarization = newPolarization ;
195 }
Have to persist the polarization vector, to truly resume a propagation
https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/rayleigh.h
SMALL Have to persist the polarization vector, to truly resume a propagation. As the angle between polarization vectors before and after scatters follows a cos-squared distribution.
Not doing so will prevent strict validation.
Collecting photons at scatters duplicates less. The post-scatter direction and polarization is generated from the pre-scatter values : better to store pre.
Virtual shell OR scatter-based ?
What is VRAM of available GPUs ?
Literature Search/Learning
Gain Experience
-> informed decisions
Where/when/what to collect ?
Too many options: experimentation needed to iterate towards solution
[1] RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location https://www.willusher.io/publications/rtx-points
Too many open questions: experimentation needed to iterate towards solution
The graphics technique of photon mapping is an old from from around 2001. It collects photons into a kd-tree which is a space partitioning data structure designed to speed up spatial queries.
Balancing the tree has the advantage that you can then easily serialize it and traverse it just by bit manipulations on the array index.
This is just one example of spatial data structures, searches might reveal some CUDA implementation that would lead me to trying another approach.
To learn more about photon mapping, this is the classic book on the subject. The lower link is a 200 page course on photon mapping.
Highlights
Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, eliminating memory and time bottlenecks.
- neutrino telescope simulation can benefit drastically from Opticks
- Drastic speedup -> better detector understanding -> greater precision
- more photon limited -> more overall speedup ( 99.9% -> 1000x )
- graphics : rich source of techniques, inspiration, CUDA code to try
![]()
| https://bitbucket.org/simoncblyth/opticks | code repository |
| https://simoncblyth.bitbucket.io | presentations and videos |
| https://groups.io/g/opticks | forum/mailing list archive |
| email:opticks+subscribe@groups.io | subscribe to mailing list |
So in summary : Opticks applies the best available GPU ray tracing to optical photon simulation resulting in speedups exceeding three orders of magnitude.
Neutrino telescope simulation is so extremely limited by the photons that there is potential for large overall speedups.
The field of graphics has lots of potential to provide techniques and code for working with really large photon propagations.