Huge CPU Memory+Time Expense
Realtime RTX render, 1 Ampere GPU
https://www.youtube.com/watch?v=NgcYLIvlp_k
NVIDIA GeForce RTX 3090 [USD 1499]
2560*1440 = 3.7M pixels -> x30 -> 110M pixels/s
DLSS : Deep Learning Super Sampling
Interfaces over NVIDIA Driver
Driver Updates : Independant of Application
Three Similar Interfaces over same RTX tech:
NVIDIA OptiX (Linux, Windows) [2009]
Vulkan RT (Linux, Windows) [final spec 2020]
Microsoft DXR : DirectX 12 Ray Tracing (Windows) [2018]
Metal Ray Tracing API (macOS) [introduced 2020[1]]
[1] https://developer.apple.com/videos/play/wwdc2020/10012/
Acceleration Structure (AS) traversal is central to pipeline performance
RG : Ray Generation
IS : Intersect
CH : Closest Hit
AH : Any Hit
MS : Miss
GPU Opticks
Tree of Bounding Boxes (bbox)
OptiX supports multiple instance levels : IAS->IAS->GAS BUT: Simple two-level is faster : works in hardware RT Cores
SBT : Shader Binding Table
Flexibly binds together:
Hidden in OptiX 1-6 APIs
NPY Serialization Benefits
a = np.load("transforms.npy")
Arrays for Everything -> direct access debug
read/write/stream NumPy arrays from C++
Array-oriented : separate data from compute
Opticks/NPY pkg : Array Interface Using glm::mat4 glm::vec4
Opticks/GGeo classes implemented with NPY arrays
[1] http://www.numpy.org/neps/nep-0001-npy-format.html
Load NumPy array into C/C++
Straightforward to parse NPY files
http://github.com/simoncblyth/np/
NP::Load
// gcc NPMinimal.cc -lstdc++ && ./a.out /tmp/a.npy #include "NP.hh" int main(int argc, char** argv) { assert( argc > 1 && argv[1] ) ; NP* a = NP ::Load(argv[1]) ; a->dump(); return 0 ; }
IPython persisting NumPy arrays:
In [1]: a = np.arange(10) # array of 10 long (int64) In [2]: a Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [3]: np.save("/tmp/a.npy", a ) # serialize array to file In [4]: a2 = np.load("/tmp/a.npy") # load array into memory In [5]: assert np.all( a == a2 ) # check all elements the same
IPython examine NPY file format:
In [6]: !xxd /tmp/a.npy # xxd hexdump file contents
# minimal header : type and shape
00000000: 934e 554d 5059 0100 7600 7b27 6465 7363 .NUMPY..v.{'desc
00000010: 7227 3a20 273c 6938 272c 2027 666f 7274 r': '<i8', 'fort
00000020: 7261 6e5f 6f72 6465 7227 3a20 4661 6c73 ran_order': Fals
00000030: 652c 2027 7368 6170 6527 3a20 2831 302c e, 'shape': (10,
00000040: 292c 207d 2020 2020 2020 2020 2020 2020 ), }
00000050: 2020 2020 2020 2020 2020 2020 2020 2020
00000060: 2020 2020 2020 2020 2020 2020 2020 2020
00000070: 2020 2020 2020 2020 2020 2020 2020 200a .
00000080: 0000 0000 0000 0000 0100 0000 0000 0000 ................
00000090: 0200 0000 0000 0000 0300 0000 0000 0000 ................
..
Geant4 -> Opticks/GGeo -> OptiX
Multi-stage translation
GGeo, GVolume, GParts, GMesh, GMaterial, ...
OGeo, OGeometry, OBndLib, ...
Structural volumes : G4PVPlacement ->
Solid shapes : G4VSolid ->
Material/surface properties as function of wavelength
Translation steered by X4 package
https://bitbucket.org/simoncblyth/opticks/src/master/extg4/X4PhysicalVolume.hh
Form of GPU Detector Geometry
JUNO: ~300,000 GVolume -> ~10 GMergedMesh
JUNO: ~300,000 GVolume : mostly small repeated groups (PMTs)
GGeo/GInstancer
For each repeat+remainder create GMergedMesh:
GMergedMesh -> IAS+GAS
https://bitbucket.org/simoncblyth/opticks/src/master/ggeo/GInstancer.hh
Optimization : deciding where to draw lines between:
Where those lines are drawn defines the AS
https://developer.nvidia.com/blog/best-practices-using-nvidia-rtx-ray-tracing/
https://developer.nvidia.com/blog/best-practices-using-nvidia-rtx-ray-tracing/
Advantages apply equally to acceleration structures
Equivalent Intersects -> same t
Local Frame Advantages
Geometry Instancing Advantages
Requirements
OpenGL Mesh Instancing
OptiX Ray Traced Instancing
Shapes from buffer content:
Populated from GMergedMesh
OptiXRap/cu/intersect_analytic.cu
425 RT_PROGRAM void intersect(int primIdx) 426 { 427 const Prim& prim = primBuffer[primIdx]; 428 429 unsigned partOffset = prim.partOffset() ; 430 unsigned numParts = prim.numParts() ; 431 unsigned primFlag = prim.primFlag() ; 432 433 if(primFlag == CSG_FLAGNODETREE) 434 { 435 evaluative_csg( prim, primIdx ); 436 } 474 }
229 RT_PROGRAM void bounds (int primIdx, float result[6]) 230 { 251 optix::Aabb* aabb = (optix::Aabb*)result; 252 *aabb = optix::Aabb(); 253 254 uint4 identity = identityBuffer[instance_index*primitive_count+primIdx] ; ... 271 const Prim& prim = primBuffer[primIdx]; ... 294 if(primFlag == CSG_FLAGNODETREE || primFlag == CSG_FLAGINVISIBLE ) 295 { 301 csg_bounds_prim(primIdx, prim, aabb); 318 } 385 }
https://bitbucket.org/simoncblyth/opticks/src/master/optixrap/cu/intersect_analytic.cu
Outside/Inside Unions
dot(normal,rayDir) -> Enter/Exit
Complete Binary Tree, pick between pairs of nearest intersects:
UNION tA < tB | Enter B | Exit B | Miss B |
---|---|---|---|
Enter A | ReturnA | LoopA | ReturnA |
Exit A | ReturnA | ReturnB | ReturnA |
Miss A | ReturnB | ReturnB | ReturnMiss |
Materials/Surfaces -> GPU Texture
Material/Surface/Scintillator properties
Material/surface boundary : 4 indices
Primitives labelled with unique boundary index
simple/fast properties + reemission wavelength
G4 Structure Tree -> Instance+Global Arrays -> OptiX
Group structure into repeated instances + global remainder:
instancing -> huge memory savings for JUNO PMTs
JUNO analytic, 400M photons from center | Speedup | |
---|---|---|
Geant4 Extrap. | 95,600 s (26 hrs) | |
Opticks RTX ON (i) | 58 s | 1650x |
100M photon RTX times, avg of 10
Launch times for various geometries | |||
---|---|---|---|
Geometry | Launch (s) | Giga Rays/s | Relative to ana |
JUNO ana | 13.2 | 0.07 | |
JUNO tri.sw | 6.9 | 0.14 | 1.9x |
JUNO tri.hw | 2.2 | 0.45 | 6.0x |
Boxtest ana | 0.59 | 1.7 | |
Boxtest tri.sw | 0.62 | 1.6 | |
Boxtest tri.hw | 0.30 | 3.3 | 1.9x |
JUNO 15k triangles, 132M without instancing
Simple Boxtest geometry gets into ballpark
OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid
GPU : Geometry Rethink Mandatory
Opticks > 1500x Geant4 (one Turing GPU)
Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.
- Drastic speedup -> better detector understanding -> greater precision
- any simulation limited by optical photons can benefit
- more photon limited -> more overall speedup (99% -> 100x)
https://bitbucket.org/simoncblyth/opticks | code repository |
https://simoncblyth.bitbucket.io | presentations and videos |
https://groups.io/g/opticks | forum/mailing list archive |
email:opticks+subscribe@groups.io | subscribe to mailing list |
CSG Binary Tree
Primitives combined via binary operators
Simple by construction definition, implicit geometry.
CSG expressions
3D Parametric Ray : ray(t) = r0 + t rDir
Ray Geometry Intersection
How to pick exactly ?
In/On/Out transitions
Classical Roth diagram approach
Computational requirements:
BUT : High performance on GPU requires:
Classical approach not appropriate on GPU
Bit Twiddling Navigation
Geant4 solid -> CSG binary tree (leaf primitives, non-leaf operators, 4x4 transforms on any node)
Serialize to complete binary tree buffer:
Height 3 complete binary tree with level order indices:
depth elevation 1 0 3 10 11 1 2 100 101 110 111 2 1 1000 1001 1010 1011 1100 1101 1110 1111 3 0
postorder_next(i,elevation) = i & 1 ? i >> 1 : (i << elevation) + (1 << elevation) ; // from pattern of bits
Postorder tree traverse visits all nodes, starting from leftmost, such that children are visited prior to their parents.
fullTree = PACK( 1 << height, 1 >> 1 ) // leftmost, parent_of_root(=0) tranche.push(fullTree, ray.tmin) while (!tranche.empty) // stack of begin/end indices { begin, end, tmin <- tranche.pop ; node <- begin ; while( node != end ) // over tranche of postorder traversal { elevation = height - TREE_DEPTH(node) ; if(is_primitive(node)){ isect <- intersect_primitive(node, tmin) ; csg.push(isect) } else{ i_left, i_right = csg.pop, csg.pop // csg stack of intersect normals, t l_state = CLASSIFY(i_left, ray.direction, tmin) r_state = CLASSIFY(i_right, ray.direction, tmin) action = LUT(operator(node), leftIsCloser)(l_state, r_state) if( action is ReturnLeft/Right) csg.push(i_left or i_right) else if( action is LoopLeft/Right) { left = 2*node ; right = 2*node + 1 ; endTranche = PACK( node, end ); leftTranche = PACK( left << (elevation-1), right << (elevation-1) ) rightTranche = PACK( right << (elevation-1), node ) loopTranche = action ? leftTranche : rightTranche tranche.push(endTranche, tmin) tranche.push(loopTranche, tminAdvanced ) // subtree re-traversal with changed tmin break ; // to next tranche } } node <- postorder_next(node, elevation) // bit twiddling postorder } } isect = csg.pop(); // winning intersect
https://bitbucket.org/simoncblyth/opticks/src/tip/optixrap/cu/csg_intersect_boolean.h
NTreeAnalyse height 11 count 25 ( un : union, cy : cylinder, di : difference ) un un di un cy cy cy un cy un cy un cy un cy un cy un cy un cy di cy cy cy
CSG trees are non-unique
Positive form CSG Trees
Apply deMorgan pushing negations down tree
End with only UNION, INTERSECT operators, and some complemented leaves.
COMMUTATIVE -> easily rearranged
1st step to allow balancing : Positivize : remove CSG difference di operators
... ... un cy un cy un cy un cy un cy di cy cy cy
... ... un cy un cy un cy un cy un cy in cy cy !cy
NTreeAnalyse height 4 count 25 un un un un un un in un un un un cy in cy !cy cy cy cy cy cy cy cy cy cy !cy
un : union, in : intersect, cy : cylinder, !cy : complemented cylinder
Balancing positive tree:
Not a general balancer : but succeeds with all CSG solid trees from Daya Bay and JUNO so far
https://bitbucket.org/simoncblyth/opticks/src/default/npy/NTreeBalance.cpp
Torus artifacts
3D parametric ray : ray(x,y,z;t) = rayOrigin + t * rayDirection
High order equation
Best Solution : replace torus