Opticks Experience of GPU Optical Photon Simulation with NVIDIA OptiX

Opticks GPU Optical Simulation with NVIDIA® OptiX™ - Development Experience : Problems and Successes

Open source, https://bitbucket.org/simoncblyth/opticks

Simon C Blyth, IHEP, CAS — HSF Simulation Working Group Meeting, 27 May 2020

Outline

JUNO Optical Photon Simulation Problem...
GPU architecture and Ray tracing
- CPU vs GPU architectures, Latency vs Throughput
- Understanding GPU Graphical Origins -> Effective GPU Computation
- Optical Photon Simulation ≈ Ray Traced Image Rendering
- Rasterization and Ray tracing
- Turing Built for RTX, BVH : Bounding Volume Hierarchy
- NVIDIA OptiX Ray Tracing Engine
Opticks : Translate Geant4 Context to GPU
- Geant4 + Opticks Hybrid Workflow : External Optical Photon Simulation
- Opticks : Translates G4 Optical Physics to CUDA/OptiX
- G4Solid -> CUDA Intersect Functions for ~10 Primitives
- G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG
- Opticks : Translates G4 Geometry to GPU, Without Approximation
Validation and Performance
- Validation of Opticks Simulation by Comparison with Geant4
- Perfomance Scanning from 1M to 400M Photons
Opticks Experience : Problems and Successes
- Main Operational Problem : Manpower
- Main Technical Problem : Geometry Translation
- Further Problems with using NVIDIA OptiX
- Benefits from using NVIDIA OptiX
Overview + Links

JUNO Optical Photon Simulation Problem...

CPU vs GPU architectures, Latency vs Throughput

/env/presentation/nvidia/cpu_vs_gpu_architecture.png

Waiting for memory read/write, is major source of latency...

CPU : latency-oriented : Minimize time to complete single task : avoid latency with caching

complex : caching system, branch prediction, speculative execution, ...

GPU : throughput-oriented : Maximize total work per unit time : hide latency with parallelism

many simple processing cores, hardware multithreading, SIMD (single instruction multiple data)
simpler : lots of compute (ALU), at expense of cache+control
design assumes abundant parallelism

Effective use of Totally different processor architecture -> Total reorganization of data and computation

Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext

Understanding GPU Graphical Origins -> Effective GPU Computation

GPUs evolved to rasterize 3D graphics at 30/60 fps

30/60 "launches" per second, each handling millions of items
literally billions of small "shader" programs run per second

Simple Array Data Structures (N-million,4)

millions of vertices, millions of triangles
vertex: (x y z w)
colors: (r g b a)

Constant "Uniform" 4x4 matrices : scaling+rotation+translation

4-component homogeneous coordinates -> easy projection

Graphical Experience Informs Fast Computation on GPUs

array shapes similar to graphics ones are faster
- "float4" 4*float(32bit) = 128 bit memory reads are favored
- Opticks photons use "float4x4" just like 4x4 matrices
GPU Launch frequency < ~30/60 per second
- avoid copy+launch overheads becoming significant
- ideally : handle millions of items in each launch

Optical Photon Simulation ≈ Ray Traced Image Rendering

Much in common : geometry, light sources, optical physics

simulation : photon parameters at PMT detectors
rendering : pixel values at image plane
both limited by ray geometry intersection, aka ray tracing

Many Applications of ray tracing :

advertising, design, architecture, films, games,...
-> huge efforts to improve hw+sw over 30 yrs

Ray-tracing vs Rasterization

/env/presentation/nvidia/nv_rasterization.png

/env/presentation/nvidia/nv_raytrace.png

TURING BUILT FOR RTX 2

`Spatial Index Acceleration Structure`

NVIDIA® OptiX™ Ray Tracing Engine -- http://developer.nvidia.com/optix

OptiX makes GPU ray tracing accessible

accelerates ray-geometry intersections
simple : single-ray programming model
"...free to use within any application..."
access RT Cores[1] with OptiX 6.0.0+ via RTX™ mode

NVIDIA expertise:

compiler optimized for GPU ray tracing
~linear scaling up to 4 GPUs
acceleration structure creation + traversal (Blue)
instanced sharing of geometry + acceleration structures

Opticks provides (Yellow):

ray generation program
ray geometry intersection+bbox programs

[1] Turing RTX GPUs

Geant4OpticksWorkflow

Opticks : Translates G4 Optical Physics to CUDA/OptiX

OptiX : single-ray programming model -> line-by-line translation

CUDA Ports of Geant4 classes

G4Cerenkov (only generation loop)
G4Scintillation (only generation loop)
G4OpAbsorption
G4OpRayleigh
G4OpBoundaryProcess (only a few surface types)

Modify Cherenkov + Scintillation Processes

collect genstep, copy to GPU for generation
avoids copying millions of photons to GPU

Scintillator Reemission

fraction of bulk absorbed "reborn" within same thread
wavelength generated by reemission texture lookup

Opticks (OptiX/Thrust GPU interoperation)

OptiX : upload gensteps
Thrust : seeding, distribute genstep indices to photons
OptiX : launch photon generation and propagation
Thrust : pullback photons that hit PMTs
Thrust : index photon step sequences (optional)

G4Solid -> CUDA Intersect Functions for ~10 Primitives

3D parametric ray : ray(x,y,z;t) = rayOrigin + t * rayDirection
implicit equation of primitive : f(x,y,z) = 0
-> polynomial in t , roots: t > t_min -> intersection positions + surface normals

/env/presentation/tboolean_parade_sep2017.png

Sphere, Cylinder, Disc, Cone, Convex Polyhedron, Hyperboloid, Torus, ...

G4Boolean -> CUDA/OptiX Intersection Program Implementing CSG

Complete Binary Tree, pick between pairs of nearest intersects:

UNION tA < tB	Enter B	Exit B	Miss B
Enter A	ReturnA	LoopA	ReturnA
Exit A	ReturnA	ReturnB	ReturnA
Miss A	ReturnB	ReturnB	ReturnMiss

Nearest hit intersect algorithm [1] avoids state
- sometimes Loop : advance t_min , re-intersect both
- classification shows if inside/outside
Evaluative [2] implementation emulates recursion:
- recursion not allowed in OptiX intersect programs
- bit twiddle traversal of complete binary tree
- stacks of postorder slices and intersects
Identical geometry to Geant4
- solving the same polynomials
- near perfect intersection match

[1] Ray Tracing CSG Objects Using Single Hit Intersections, Andrew Kensler (2006): with corrections by author of XRT Raytracer http://xrt.wikidot.com/doc:csg
[2] https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/csg_intersect_boolean.h: Similar to binary expression tree evaluation using postorder traverse.

CSG Complete Binary Tree Serialization -> simplifies GPU side

Geant4 solid -> CSG binary tree (leaf primitives, non-leaf operators, 4x4 transforms on any node)

Serialize to complete binary tree buffer:

no need to deserialize, no child/parent pointers
bit twiddling navigation avoids recursion
simple approach profits from small size of binary trees
BUT: very inefficient when unbalanced

Height 3 complete binary tree with level order indices:

                                                   depth     elevation

                     1                               0           3

          10                   11                    1           2

     100       101        110        111             2           1

 1000 1001  1010 1011  1100 1101  1110  1111         3           0

postorder_next(i,elevation) = i & 1 ? i >> 1 : (i << elevation) + (1 << elevation) ; // from pattern of bits

Postorder tree traverse visits all nodes, starting from leftmost, such that children are visited prior to their parents.

Opticks : Translates G4 Geometry to GPU, Without Approximation

Material/Surface/Scintillator properties

interpolated to standard wavelength domain
interleaved into "boundary" texture
"reemission" texture for wavelength generation

Material/surface boundary : 4 indices

outer material (parent)
outer surface (inward photons, parent -> self)
inner surface (outward photons, self -> parent)
inner material (self)

Primitives labelled with unique boundary index

ray primitive intersection -> boundary index
texture lookup -> material/surface properties

simple/fast properties + reemission wavelength

G4 Structure Tree -> Instance+Global Arrays -> OptiX

Group structure into repeated instances + global remainder:

auto-identify repeated geometry with "progeny digests"
- JUNO : 5 distinct instances + 1 global
instance transforms used in OptiX/OpenGL geometry

instancing -> huge memory savings for JUNO PMTs

j1808_top_rtx

j1808_top_ogl

Validation of Opticks Simulation by Comparison with Geant4

Bi-simulations of all JUNO solids, with millions of photons

mis-aligned histories: mostly < 0.25%, < 0.50% for largest solids
deviant photons within matched history: < 0.05% (500/1M)

Primary sources of problems

grazing incidence, edge skimmers
incidence at constituent solid boundaries

Primary cause : float vs double

Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)

Conclude

neatly oriented photons more prone to issues than realistic ones
perfect "technical" matching not feasible
instead shift validation to more realistic full detector "calibration" situation

scan-pf-check-GUI-TO-SC-BT5-SD

scan-pf-check-GUI-TO-BT5-SD

Performance : Scanning from 1M to 400M Photons

Full JUNO Analytic Geometry j1808v5

"calibration source" genstep at center of scintillator

Production Mode : does the minimum

only saves hits
skips : genstep, photon, source, record, sequence, index, ..
no Geant4 propagation (other than at 1M for extrapolation)

Multi-Event Running, Measure:

interval: avg time between successive launches, including overheads: (upload gensteps + launch + download hits)
launch: avg of 10 OptiX launches

overheads < 10% beyond 20M photons

`NVIDIA Quadro RTX 8000 (48G)`

谢谢 NVIDIA China
for loaning the card

scan-pf-1_NHit

scan-pf-1_Opticks_vs_Geant4 2

JUNO analytic, 400M photons from center		Speedup
Geant4 Extrap.	95,600 s (26 hrs)
Opticks RTX ON (i)	58 s	1650x

scan-pf-1_Opticks_Speedup 2

JUNO analytic, 400M photons from center		Speedup
Opticks RTX ON (i)	58s	1650x
Opticks RTX OFF (i)	275s	350x
Geant4 Extrap.	95,600s (26 hrs)

scan-pf-1_RTX_Speedup

5x Speedup from RTX with JUNO analytic geometry

Opticks Experience : Main Operational Problem : Manpower

Lots of interest, very little contribution, why ?

Tool Innovation is Disincentivized ?

students/postdocs interested in Opticks
advisors steer them to analysis : less risky, better for career

Why GPU simulation development difficult ?

totally different geometry model
- tree of C++ objects -> arrays, textures
- solid primitives -> intersection by solving polynomials
totally different development model
- complex libraries -> simple headers
- simpler -> smaller stack -> more threads in flight
- low level CUDA development, eg CSG from first principals
- very few libs
- restricted CUDA environment
  - no recursion in intersect
  - no shared memory/synchronizations/barriers
  - double precision problematic, performance hit

Opticks Experience : Main Technical Problem : Geometry Translation

Intersection Performance -> Simulation Performance, Drivers:

acceleration structure (AS) eg BVH
geometry model input to AS

Analytic Geometry : translate volume -> surface based model

Coincident faces (even in CSG boolean constituents)

very common problem, causes spurious intersects
- manual modelling changes : avoiding coincidence
CSG serialized using complete binary tree
- simple+convenient, very inefficient for unbalanced trees
- balancing enables support for more complex trees
- v.complicated solids (G4Boolean abuse) still problematic

Analytic Torus Intersection

double precision quartic root solving
very large coefficient range, robust solution difficult
- many techniques tried (numerical/computer science papers)
very heavy kernel : 10x performance impact even when unused
pragmatic solution : avoid torus, or use triangulated

Opticks Experience : Problems with using NVIDIA OptiX

NVIDIA GPUs only
No influence on direction of NVIDIA OptiX (eg OptiX 7)
immature OptiX support for Linux debug/profiling
- most OptiX users develop on Windows ?
- cuda-gdb not to level of Nsight VSE

Optimization Issues

difficult to approach 10 GigaRays/s
closed source : black box BVH
- blind experimentation

Linux GPU Cluster (eg Tesla V100) Deployment Issues

OptiX releases demand lastest short lived branch driver
GPU clusters typically use long lived branch drivers
- drivers appropriate for OptiX slow to appear
- must use older OptiX release for many months
RTX performance, RT cores ~not available in server GPUs[1,2]
- 4 non-RTX GPUs ~ single RTX GPU performance
https://www.nvidia.com/en-us/design-visualization/quadro-data-center/

[1] NVIDIA RTX Server with 8x NVIDIA Quadro RTX 8000 : probably restricted to car, design, film companies ... [2] NVIDIA Quadro RTX 8000 PCIe Server Card (Passive)

Opticks Experience : Benefits from using NVIDIA OptiX

NVIDIA OptiX 3,4,5,6

excellent easy to use API
useful shared host/device context
flexible geometry : implement intersection code
automated acceleration structure (AS) building/traversal
instancing of geometry + AS, essential for JUNO PMTs
transparent ~linear scaling up to 4 GPUs
graphics interop buffer sharing CUDA/Thrust/OptiX/OpenGL
- in-situ visualization with no data movement
straightforward port of Geant4 optical physics

NVIDIA OptiX 6

accelerated with ray trace dedicated hardware (RT Cores)
- 5x performance from RTX

1 or 2 Releases per Year

until OptiX 7, fairly compatible API changes
tuning for new GPUs

Summary

Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.

Drastic speedup -> better detector understanding -> greater precision

any simulation limited by optical photons can benefit

more photon limited -> more overall speedup (99% -> 100x)

https://bitbucket.org/simoncblyth/opticks	code repository
https://simoncblyth.bitbucket.io	presentations and videos
https://groups.io/g/opticks	forum/mailing list archive
email:opticks+subscribe@groups.io	subscribe to mailing list