Standard Simulation Tool of HEP
Geant4 simulates particles travelling through matter
Geant4 Approach
Very General and Capable Tool
Huge CPU Memory+Time Expense
Not a Photo, a Calculation
Much in common : geometry, light sources, optical physics
Many Applications of ray tracing :
10 Giga Rays/s |
Offload Ray Trace to Dedicated HW
SM : Streaming Multiprocessor
BVH : Bounding Volume Hierarchy
RTX Platform : Hybrid Rendering
-> real-time photoreal cinematic 3D rendering
Tree of Bounding Boxes (bbox)
OptiX Raytracing Pipeline
Analogous to OpenGL rasterization pipeline:
OptiX makes GPU ray tracing accessible
NVIDIA expertise:
Opticks provides (Yellow):
[1] Turing RTX GPUs
GPU Resident Photons
Thrust: high level C++ access to CUDA
OptiX : single-ray programming model -> line-by-line translation
Outside/Inside Unions
dot(normal,rayDir) -> Enter/Exit
Complete Binary Tree, pick between pairs of nearest intersects:
UNION tA < tB | Enter B | Exit B | Miss B |
---|---|---|---|
Enter A | ReturnA | LoopA | ReturnA |
Exit A | ReturnA | ReturnB | ReturnA |
Miss A | ReturnB | ReturnB | ReturnMiss |
Materials/Surfaces -> GPU Texture
Material/Surface/Scintillator properties
Material/surface boundary : 4 indices
Primitives labelled with unique boundary index
simple/fast properties + reemission wavelength
G4 Structure Tree -> Instance+Global Arrays -> OptiX
Group structure into repeated instances + global remainder:
instancing -> huge memory savings for JUNO PMTs
Random Aligned Bi-Simulation
Same inputs to Opticks and Geant4:
Common recording into OpticksEvents:
Aligned random consumption, direct comparison:
Bi-simulations of all JUNO solids, with millions of photons
Primary sources of problems
Primary cause : float vs double
Geant4 uses double everywhere, Opticks only sparingly (observed double costing 10x slowdown with RTX)
Conclude
Compression Essential
Domain compression to fit in VRAM
4-bit History Flags at Each Step
BT : boundary BR : boundary reflect SC : bulk scatter AB : bulk absorb SD : surface detect SA : surface absorb
Up to 16 steps of the photon propagation are recorded.
Photon Array : 4 * float4 = 512 bits/photon
Step Record Array : 2 * short4 = 2*16*4 = 128 bits/record
Compression uses known domains of position (geometry center, extent), time (0:200ns), wavelength, polarization.
Test Hardware + Software
Workstation
Software
IHEP GPU Cluster
Full JUNO Analytic Geometry j1808v5
Production Mode : does the minimum
Multi-Event Running, Measure:
Photon Launch Size : VRAM Limited
NVIDIA Quadro RTX 8000 (48 GB)
400M photons x 112 bytes ~ 45G
JUNO analytic, 400M photons from center | Speedup | |
---|---|---|
Geant4 Extrap. | 95,600 s (26 hrs) | |
Opticks RTX ON (i) | 58 s | 1650x |
JUNO analytic, 400M photons from center | Speedup | |
---|---|---|
Opticks RTX ON (i) | 58s | 1650x |
Opticks RTX OFF (i) | 275s | 350x |
Geant4 Extrap. | 95,600s (26 hrs) |
5x Speedup from RTX with JUNO analytic geometry |
100M photon RTX times, avg of 10
Launch times for various geometries | |||
---|---|---|---|
Geometry | Launch (s) | Giga Rays/s | Relative to ana |
JUNO ana | 13.2 | 0.07 | |
JUNO tri.sw | 6.9 | 0.14 | 1.9x |
JUNO tri.hw | 2.2 | 0.45 | 6.0x |
Boxtest ana | 0.59 | 1.7 | |
Boxtest tri.sw | 0.62 | 1.6 | |
Boxtest tri.hw | 0.30 | 3.3 | 1.9x |
JUNO 15k triangles, 132M without instancing
Simple Boxtest geometry gets into ballpark
OptiX Performance Tools and Tricks, David Hart, NVIDIA https://developer.nvidia.com/siggraph/2019/video/sig915-vid
NVIDIA OptiX 7 : Entirely new API
JUNO+Opticks into Production
Geant4+Opticks Integration : Work with Geant4 Collaboration
Alpha Development ------>-----------------> Robust Tool
How is >1000x possible ?
Progress over 30 yrs, Billions of Dollars
Photon Simulation ideally suited to GPU
Dynamically generated simulation feasible ?
Three revolutions reinforcing each other:
Deep rivers of development, ripe for re-purposing
Example : DL denoising for faster ray trace convergence
Re-evaluate long held practices in light of new realities:
Highlights 2019
Opticks : state-of-the-art GPU ray tracing applied to optical photon simulation and integrated with Geant4, giving a leap in performance that eliminates memory and time bottlenecks.
- Drastic speedup -> better detector understanding -> greater precision
- any simulation limited by optical photons can benefit
- more photon limited -> more overall speedup (99% -> 100x)
https://bitbucket.org/simoncblyth/opticks | code repository |
https://simoncblyth.bitbucket.io | presentations and videos |
https://groups.io/g/opticks | forum/mailing list archive |
email:opticks+subscribe@groups.io | subscribe to mailing list |
S(n) Expected Speedup
optical photon simulation, P ~ 99% of CPU time
Must consider processing "big picture"
OpenGL Rasterization Pipeline
GPUs evolved to rasterize 3D graphics at 30/60 fps
Simple Array Data Structures (N-million,4)
Constant "Uniform" 4x4 matrices : scaling+rotation+translation
Graphical Experience Informs Fast Computation on GPUs
Waiting for memory read/write, is major source of latency...
Optical Photon Simulation
~perfect match for GPU acceleration
How Many Threads to Launch ?
Understanding Throughput-oriented Architectures https://cacm.acm.org/magazines/2010/11/100622-understanding-throughput-oriented-architectures/fulltext
NVIDIA Titan V: 80 SM, 5120 CUDA cores
Array Serialization Benefits
Persist everything to file -> fast development cycle
Can transport everything across network:
Arrays for Everything -> direct access debug
Object-oriented : mixes data and compute
Array-oriented : separate data from compute
NumPy : standard array handling package
https://realpython.com/numpy-array-programming/
Look for "long" CPU loops
Q1 : What Shape is Your Data ?
longest "parallelization" axis first, eg:
longer -> more abundantly parallel -> more speedup
cross reference array elements with indices (not pointers)
Q2 : What are Independent Chunks ?
CUDA Thread for each element:
CuPy : Simplest CUDA Interface
"Production" CuPy ? Depends on requirements:
C++ Based Interfaces to CUDA
Mature NVIDIA Basis Libraries
RAPIDS : New NVIDIA "Suite" of open source data science libs
https://bitbucket.org/simoncblyth/intro_to_numpy
Very terse, array-oriented (no-loop) python interface to C performance
https://docs.scipy.org/doc/numpy/user/quickstart.html
http://www.scipy-lectures.org/intro/index.html
https://github.com/donnemartin/data-science-ipython-notebooks
import numpy as np, cupy as cp def ellipse_closest_approach_to_point( xp, ex, ez, _c, N ): """ex, ez: ellipse semi-axes, c: coordinates of point in ellipse frame""" c = xp.asarray( _c ) ; assert c.shape == (2,) t = xp.linspace( 0, 2*np.pi, N ) # t: array of N angles [0,2pi] e = xp.zeros( [len(t), 2] ) e[:,0] = ex*xp.cos(t) e[:,1] = ez*xp.sin(t) # e: N parametric [x,z] points on the ellipse return e[xp.sum((e-c)**2, axis=1).argmin()] # point on ellipse closest to c
expression | shape | note |
---|---|---|
e | (10000000,2) | |
c | (2,) | |
e-c | (10000000,2) | c is broadcast over e : must be compatible shape |
np.sum((e-c)**2, 1) | (10000000,) | axis=1 : summing over the axis of length 2 |
np.sum((e-c)**2, 0) | (2,) | axis=0 : summing over axis of length 10M |
np.sum((e-c)**2, None) | () | axis=None : summing over all elements, yielding scalar |
17 import numpy as np, cupy as cp
...
67 if __name__ == '__main__':
68
69 N = 10000000 # 10M is large enough to make overheads negligible
70 M = 8 # repeat timings
71
72 ex,ey = 10,20 # parameters of ellipse
73 c = [100,100] # point to check
74
75 r = {} # python dict for results
76 t = np.zeros(M) # numpy array for timings
77
78 for i in range(M):
79 if i == 0:
80 r[i],t[i] = timed(pure_python_ellipse_closest_approach_to_point_DO_NOT_DO_THIS)(ex,ey,c,N)
81 else:
82 xp = np if i < M/2 else cp # switch between NumPy and CuPy implementations of same API
83 r[i],t[i] = timed(ellipse_closest_approach_to_point)( xp, ex,ey,c, N )
84 pass
85 pass
86 for k in r.keys():
87 if k == M/2: print("")
88 print(k, r[k], t[k])
89 pass
[blyth@localhost python_vs_numpy_vs_cupy]$ ipython -i ellipse_closest_approach_to_point.py # "ipython -i" : run python script leaving context for interactive examination Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 23:32:55) IPython 5.7.0 -- An enhanced Interactive Python. ... (0, (4.983054096206095, 17.34003135802051), '6.266726970672607') (1, array([ 4.9830541 , 17.34003136]), '0.674299955368042') (2, array([ 4.9830541 , 17.34003136]), '0.6548769474029541') (3, array([ 4.9830541 , 17.34003136]), '0.6450159549713135') (4, array([ 4.9830541 , 17.34003136]), '0.4854261875152588') # 1st CuPy run CUDA compiles populating cache (5, array([ 4.9830541 , 17.34003136]), '0.000415802001953125') (6, array([ 4.9830541 , 17.34003136]), '0.0003409385681152344') (7, array([ 4.9830541 , 17.34003136]), '0.000762939453125') In [1]: tpy = t[0] In [2]: tnp = np.average(t[1:M/2]) In [3]: tcp = np.average(t[M/2+1:]) In [4]: tpy/tnp Out[4]: 9.522970786915796 # NumPy ~ 10x Python In [5]: tnp/tcp Out[5]: 1299.0845622842799 # CuPy > 1000x NumPy : with almost zero effort can apply the CUDA stack
Load NumPy array into C/C++
Straightforward to parse NPY files
http://github.com/simoncblyth/np/
NP::Load
// gcc NPMinimal.cc -lstdc++ && ./a.out /tmp/a.npy #include "NP.hh" int main(int argc, char** argv) { assert( argc > 1 && argv[1] ) ; NP* a = NP ::Load(argv[1]) ; a->dump(); return 0 ; }
IPython persisting NumPy arrays:
In [1]: a = np.arange(10) # array of 10 long (int64) In [2]: a Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [3]: np.save("/tmp/a.npy", a ) # serialize array to file In [4]: a2 = np.load("/tmp/a.npy") # load array into memory In [5]: assert np.all( a == a2 ) # check all elements the same
IPython examine NPY file format:
In [6]: !xxd /tmp/a.npy # xxd hexdump file contents
# minimal header : type and shape
00000000: 934e 554d 5059 0100 7600 7b27 6465 7363 .NUMPY..v.{'desc
00000010: 7227 3a20 273c 6938 272c 2027 666f 7274 r': '<i8', 'fort
00000020: 7261 6e5f 6f72 6465 7227 3a20 4661 6c73 ran_order': Fals
00000030: 652c 2027 7368 6170 6527 3a20 2831 302c e, 'shape': (10,
00000040: 292c 207d 2020 2020 2020 2020 2020 2020 ), }
00000050: 2020 2020 2020 2020 2020 2020 2020 2020
00000060: 2020 2020 2020 2020 2020 2020 2020 2020
00000070: 2020 2020 2020 2020 2020 2020 2020 200a .
00000080: 0000 0000 0000 0000 0100 0000 0000 0000 ................
00000090: 0200 0000 0000 0000 0300 0000 0000 0000 ................
..
Create 3D NumPy Array from C/C++
// gcc NPMiniMake.cc -lstdc++ && ./a.out 3 2 4 #include "NP.hh" int main(int argc, char** argv) { int ni = argc > 1 ? atoi(argv[1]) : 10 ; int nj = argc > 2 ? atoi(argv[2]) : 1 ; int nk = argc > 3 ? atoi(argv[3]) : 4 ; NP* a = new NP (ni,nj,nk) ; for(int i=0 ; i < ni ; i++ ){ for(int j=0 ; j < nj ; j++ ){ for(int k=0 ; k < nk ; k++ ){ int index = i*nj*nk + j*nk + k ; // 3d -> 1d a->data[index] = float(index) ; // dummy value } } } a->save("/tmp/a.npy") ; return 0 ; }
Check C++ created NumPy array from commandline:
$ python -c "import numpy as np ; print(np.load('/tmp/a.npy'))" [[[ 0. 1. 2. 3.] [ 4. 5. 6. 7.]] [[ 8. 9. 10. 11.] [12. 13. 14. 15.]] [[16. 17. 18. 19.] [20. 21. 22. 23.]]]
Access/Create Array Data Anywhere
google:"parse NumPy file with Java/Rust/R/Swift/Android/..."
Store int/uint bits into float array
#include <iostream> #include <cassert> #include "uif.h" int main(int argc, char** argv) { int value = argc > 1 ? atoi(argv[1]) : -10 ; uif_t uif ; uif.i = value ; float a[3] = { 1.f, 1.f, 1.f } ; a[1] = uif.f ; // plant int bits into float array uif_t uif2 ; uif2.f = a[1] ; // float copy assert( uif2.i == value ) ; // recover int // std::cout << ... elided for brevity return 0 ; }
$ gcc uifDemo.cc -lstdc++ && ./a.out -10 uif.u 4294967286 uif.i -10 uif.f nan uif2.u 4294967286 uif2.i -10 uif2.f nan
http://bitbucket.org/simoncblyth/intro_to_numpy/src/tip/union_trick/
//uif.h #pragma once union uif_t { // store three 32-bit types at same location unsigned u; int i ; float f; } ;
In [1]: import numpy as np In [2]: a = np.ones(3, dtype=np.float32) In [3]: a.view(np.int32)[1] = -10 # int32 inside float32 array // NumPy view reinterprets same bits as different type In [4]: a Out[4]: array([ 1., nan, 1.], dtype=float32) In [5]: a.view(np.int32) Out[5]: array([1065353216, -10, 1065353216], dtype=int32)
CUDA reinterpret device functions, also __int_as_float:
OpenGL Geometry Shaders
-> need "popularity" index for each photon
Photon History Sequence (64-bit int)
Photon record renderer, OpenGL geometry shader extracts:
01 #version 410 core .. 37 uniform ivec4 RecSelect ; // constant input to the "launch" .. 44 in ivec4 sel[]; // input array .. 46 47 layout (lines) in; 48 layout (points, max_vertices = 1) out; .. 52 void main () 53 { 54 uint seqhis = sel[0].x ; 55 uint seqmat = sel[0].y ; 56 if( RecSelect.x > 0 && RecSelect.x != seqhis ) return ; 57 if( RecSelect.y > 0 && RecSelect.y != seqmat ) return ; .. // History and Material Selection 63 vec4 p0 = gl_in[0].gl_Position ; 64 vec4 p1 = gl_in[1].gl_Position ; 65 float tc = Param.w / TimeDomain.y ; 66 // for interpolation of photon position at uniform input time ..
https://bitbucket.org/simoncblyth/opticks/src/default/thrustrap/TSparse_.cu
089 template <typename T> 90 void TSparse<T>::count_unique() 91 { /// preparation of src with strided range iterator elided for bevity 97 98 thrust::device_vector<T> data(src.begin(), src.end()); // GPU copy to avoid sorting original 99 100 thrust::sort(data.begin(), data.end()); // GPU sort of millions of integers 101 102 // inner_product of sorted data with shifted by one self finds "edges" between values 103 m_num_unique = thrust::inner_product( 104 data.begin(),data.end() - 1, // first1, last1 105 data.begin() + 1, // first2 106 int(1), // output type init 107 thrust::plus(), // reduction operator 108 thrust::not_equal_to () // pair-by-pair operator, returning 1 at edges 109 ); 116 m_values.resize(m_num_unique); 117 m_counts.resize(m_num_unique); 118 119 // find all unique key values with their counts 120 thrust::reduce_by_key( 121 data.begin(), // keys_first 122 data.end(), // keys_last 123 thrust::constant_iterator (1), // values_first 124 m_values.begin(), // keys_output 125 m_counts.begin() // values_output 126 ); /// sort into descending key order elided for brevity 147 }
Using Thrust will improve your C++ skills
Flavor of Thrust "Primitives"
adjacent_difference copy_if exclusive_scan exclusive_scan_by_key for_each gather generate inclusive_scan inner_product max_element merge_by_key min_element minmax_element reduce reduce_by_key scatter sort tabulate transform transform_inclusive_scan transform_reduce unique_by_key
Benefit from highly optimized GPU implementations
Thrust Hides GPU Details
Thrust : High level C++ interface to CUDA
An example of Thrust documentation
Best way to learn Thrust API is exercise it by writing small tests