http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
nvprof knows how to profile CUDA kernels running on NVIDIA GPUs, no matter what language they are written in (as long as they are launched using the CUDA runtime API or driver API).
http://devblogs.nvidia.com/parallelforall/pro-tip-clean-up-after-yourself-ensure-correct-profiling/
Therefore, you should clean up your application’s CUDA objects properly to make sure that the profiler is able to store all gathered data. This means not only freeing memory allocated on the GPU, but also resetting the device Context.
If your application uses the CUDA Driver API, call cuProfilerStop() on each context to flush the profiling buffers before destroying the context with cuCtxDestroy().
Also, how many registers is your kernel using?? (pass --ptxas-options=-v
argument to nvcc) If you can only launch 16 threads per block, the GPU will be
idle most of the time.
From a headless simplecamera.py render run:
1285 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.020 ]
1286 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 11.104 ]
1287 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.972 ]
1288 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.006 ]
1289 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.006 ]
1290 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.996 ]
1291 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.012 ]
1292 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.022 ]
1293 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 10.942 ]
1294 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 4.039 ]
1295 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.034 ]
1296 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.891 ]
1297 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 1.912 ]
1298 method=[ memcpyHtoD ] gputime=[ 1794.976 ] cputime=[ 1993.471 ]
1299 method=[ memcpyHtoD ] gputime=[ 1617.952 ] cputime=[ 1481.204 ]
1300 method=[ memcpyHtoD ] gputime=[ 1601.280 ] cputime=[ 1472.250 ]
1301 method=[ memcpyHtoD ] gputime=[ 7432.672 ] cputime=[ 7370.140 ]
1302 method=[ memcpyHtoD ] gputime=[ 4602.432 ] cputime=[ 4620.065 ]
1303 method=[ memcpyHtoD ] gputime=[ 2335.680 ] cputime=[ 2351.582 ]
1304 method=[ memcpyHtoD ] gputime=[ 1.664 ] cputime=[ 5.372 ]
1305 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.315 ]
1306 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.037 ]
1307 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.973 ]
1308 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.185 ]
1309 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 2.113 ]
1310 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.008 ]
1311 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.010 ]
1312 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.372 ]
1313 method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 2.009 ]
1314 method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 1.959 ]
1315 method=[ memcpyHtoD ] gputime=[ 612.832 ] cputime=[ 501.086 ]
1316 method=[ memcpyHtoD ] gputime=[ 590.560 ] cputime=[ 449.675 ]
1317 method=[ fill ] gputime=[ 24.544 ] cputime=[ 13.470 ] occupancy=[ 1.000 ]
1318 method=[ fill ] gputime=[ 25.504 ] cputime=[ 7.263 ] occupancy=[ 1.000 ]
1319 method=[ render ] gputime=[ 5259416.500 ] cputime=[ 234.175 ] occupancy=[ 0.500 ]
1320 method=[ memcpyDtoH ] gputime=[ 194.016 ] cputime=[ 5260492.000 ]
(chroma_env)delta:chroma_camera blyth$ ./cuda_profile_parse.py cuda_profile_0.log
WARNING:__main__:failed to parse : # CUDA_PROFILE_LOG_VERSION 2.0
WARNING:__main__:failed to parse : # CUDA_DEVICE 0 GeForce GT 750M
WARNING:__main__:failed to parse : # CUDA_CONTEXT 1
WARNING:__main__:failed to parse : method,gputime,cputime,occupancy
memcpyDtoH : {'gputime': 201.504, 'cputime': 5260556.83}
write_size : {'gputime': 6.208, 'cputime': 37.704, 'occupancy': 0.048}
fill : {'gputime': 50.048, 'cputime': 20.733, 'occupancy': 2.0}
render : {'gputime': 5259416.5, 'cputime': 234.175, 'occupancy': 0.5}
memcpyHtoD : {'gputime': 22289.11999999997, 'cputime': 23602.95499999999}
(chroma_env)delta:chroma_camera blyth$
(chroma_env)delta:chroma_camera blyth$ tail -5 cuda_profile_0.log
method=[ memcpyHtoD ] gputime=[ 590.560 ] cputime=[ 449.675 ]
method=[ fill ] gputime=[ 24.544 ] cputime=[ 13.470 ] occupancy=[ 1.000 ]
method=[ fill ] gputime=[ 25.504 ] cputime=[ 7.263 ] occupancy=[ 1.000 ]
method=[ render ] gputime=[ 5259416.500 ] cputime=[ 234.175 ] occupancy=[ 0.500 ]
method=[ memcpyDtoH ] gputime=[ 194.016 ] cputime=[ 5260492.000 ]
This is character string which gives the name of the GPU kernel or memory copy method. In case of kernels the method name is the mangled name generated by the compiler.
This column gives the multiprocessor occupancy which is the ratio of number of active warps to the maximum number of warps supported on a multiprocessor of the GPU. This is helpful in determining how effectively the GPU is kept busy. This column is output only for GPU kernels and the column value is a single precision floating point value in the range 0.0 to 1.0.
For non-blocking methods the cputime is only the CPU or host side overhead to launch the method. In this case:
walltime = cputime + gputime
For blocking methods cputime is the sum of gputime and CPU overhead. In this case:
walltime = cputime
Note all kernel launches by default are non-blocking. But if any of the profiler counters are enabled kernel launches are blocking. Also asynchronous memory copy requests in different streams are non-blocking.
The column value is a single precision floating point value in microseconds.
This column gives the execution time for the GPU kernel or memory copy method. This value is calculated as (gpuendtimestamp - gpustarttimestamp)/1000.0. The column value is a single precision floating point value in microseconds.
The command line profiler is controlled using the following environment variables:
COMPUTE_PROFILE: is set to either 1 or 0 (or unset) to enable or disable profiling.
COMPUTE_PROFILE_LOG: is set to the desired file path for profiling output. In case of multiple contexts you must add ‘%d’ in the COMPUTE_PROFILE_LOG name. This will generate separate profiler output files for each context - with ‘%d’ substituted by the context number. Contexts are numbered starting with zero. In case of multiple processes you must add ‘%p’ in the COMPUTE_PROFILE_LOG name. This will generate separate profiler output files for each process - with ‘%p’ substituted by the process id. If there is no log path specified, the profiler will log data to “cuda_profile_%d.log” in case of a CUDA context (‘%d’ is substituted by the context number).
COMPUTE_PROFILE_CSV: is set to either 1 (set) or 0 (unset) to enable or disable a comma separated version of the log output.
COMPUTE_PROFILE_CONFIG: is used to specify a config file for selecting profiling options and performance counters.
Configuration details are covered in a subsequent section.
The following old environment variables used for the above functionalities are still supported:
CUDA_PROFILE
CUDA_PROFILE_LOG
CUDA_PROFILE_CSV
CUDA_PROFILE_CONFIG
(chroma_env)delta:e blyth$ nvprof --query-metrics
Available Metrics:
Name Description
Device 0 (GeForce GT 750M):
l1_cache_global_hit_rate: Hit rate in L1 cache for global loads
branch_efficiency: Ratio of non-divergent branches to total branches
l1_cache_local_hit_rate: Hit rate in L1 cache for local loads and stores
sm_efficiency: The percentage of time at least one warp is active on a multiprocessor
ipc: Instructions executed per cycle
achieved_occupancy: Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor
gld_requested_throughput: Requested global memory load throughput
gst_requested_throughput: Requested global memory store throughput
sm_efficiency_instance: The percentage of time at least one warp is active on a multiprocessor
ipc_instance: Instructions executed per cycle
inst_replay_overhead: Average number of replays for each instruction executed
shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed
global_replay_overhead: Average number of replays due to local memory cache misses for each instruction executed
global_cache_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed
tex_cache_hit_rate: Texture cache hit rate
tex_cache_throughput: Texture cache throughput
dram_read_throughput: Device memory read throughput
dram_write_throughput: Device memory write throughput
gst_throughput: Global memory store throughput
gld_throughput: Global memory load throughput
local_replay_overhead: Average number of replays due to local memory accesses for each instruction executed
shared_efficiency: Ratio of requested shared memory throughput to required shared memory throughput
gld_efficiency: Ratio of requested global memory load throughput to required global memory load throughput
gst_efficiency: Ratio of requested global memory store throughput to required global memory store throughput
l2_l1_read_hit_rate: Hit rate at L2 cache for all read requests from L1 cache
l2_texture_read_hit_rate: Hit rate at L2 cache for all read requests from texture cache
l2_l1_read_throughput: Memory read throughput seen at L2 cache for read requests from L1 cache
l2_texture_read_throughput: Memory read throughput seen at L2 cache for read requests from the texture cache
local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches
issued_ipc: Instructions issued per cycle
inst_per_warp: Average number of instructions executed by each warp
issue_slot_utilization: Percentage of issue slots that issued at least one instruction, averaged across all cycles
local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load
shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store
gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load
gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store
local_load_transactions: Number of local memory load transactions
local_store_transactions: Number of local memory store transactions
shared_load_transactions: Number of shared memory load transactions
shared_store_transactions: Number of shared memory store transactions
gld_transactions: Number of global memory load transactions
gst_transactions: Number of global memory store transactions
sysmem_read_transactions: Number of system memory read transactions
sysmem_write_transactions: Number of system memory write transactions
tex_cache_transactions: Texture cache read transactions
dram_read_transactions: Device memory read transactions
dram_write_transactions: Device memory write transactions
l2_read_transactions: Memory read transactions seen at L2 cache for all read requests
l2_write_transactions: Memory write transactions seen at L2 cache for all write requests
local_load_throughput: Local memory load throughput
local_store_throughput: Local memory store throughput
shared_load_throughput: Shared memory load throughput
shared_store_throughput: Shared memory store throughput
l2_read_throughput: Memory read throughput seen at L2 cache for all read requests
l2_write_throughput: Memory write throughput seen at L2 cache for all write requests
sysmem_read_throughput: System memory read throughput
sysmem_write_throughput: System memory write throughput
cf_issued: Number of issued control-flow instructions
cf_executed: Number of executed control-flow instructions
ldst_issued: Number of issued load and store instructions
ldst_executed: Number of executed load and store instructions
flops_sp: Single-precision floating point operations executed
flops_sp_add: Single-precision floating point add operations executed
flops_sp_mul: Single-precision floating point multiply operations executed
flops_sp_fma: Single-precision floating point multiply accumulate operations executed
flops_dp: Double-precision floating point operations executed
flops_dp_add: Double-precision floating point add operations executed
flops_dp_mul: Double-precision floating point multiply operations executed
flops_dp_fma: Double-precision floating point multiply accumulate operations executed
flops_sp_special: Single-precision floating point special operations executed
l1_shared_utilization: The utilization level of the L1/shared memory relative to peak utilization
l2_utilization: The utilization level of the L2 cache relative to the peak utilization
tex_utilization: The utilization level of the texture cache relative to the peak utilization
dram_utilization: The utilization level of the device memory relative to the peak utilization
sysmem_utilization: The utilization level of the system memory relative to the peak utilization
ldst_fu_utilization: The utilization level of the multiprocessor function units that execute load and store instructions
alu_fu_utilization: The utilization level of the multiprocessor function units that execute integer and floating-point arithmetic instructions
cf_fu_utilization: The utilization level of the multiprocessor function units that execute control-flow instructions
tex_fu_utilization: The utilization level of the multiprocessor function units that execute texture instructions
inst_executed: The number of instructions executed
inst_issued: The number of instructions issued
issue_slots: The number of issue slots used
(chroma_env)delta:e blyth$ which nvprof
/Developer/NVIDIA/CUDA-5.5/bin/nvprof
(chroma_env)delta:e blyth$
(chroma_env)delta:e blyth$ nvprof --query-events
Available Events:
Name Description
Device 0 (GeForce GT 750M):
Domain domain_a:
tex0_cache_sector_queries: Number of texture cache 0 requests. This increments by 1 for each 32-byte access.
tex1_cache_sector_queries: Number of texture cache 1 requests. This increments by 1 for each 32-byte access.
tex2_cache_sector_queries: Number of texture cache 2 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
tex3_cache_sector_queries: Number of texture cache 3 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
tex0_cache_sector_misses: Number of texture cache 0 misses. This increments by 1 for each 32-byte access.
tex1_cache_sector_misses: Number of texture cache 1 misses. This increments by 1 for each 32-byte access.
tex2_cache_sector_misses: Number of texture cache 2 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
tex3_cache_sector_misses: Number of texture cache 3 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.
elapsed_cycles_sm: Elapsed clocks
Domain domain_b:
fb_subp0_read_sectors: Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.
fb_subp1_read_sectors: Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.
fb_subp0_write_sectors: Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access.
fb_subp1_write_sectors: Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access.
l2_subp0_write_sector_misses: Number of write misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_sector_misses: Number of write misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_write_sector_misses: Number of write misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_write_sector_misses: Number of write misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_sector_misses: Number of read misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_sector_misses: Number of read misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_sector_misses: Number of read misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_sector_misses: Number of read misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_l1_sector_queries: Number of write requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_l1_sector_queries: Number of write requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_write_l1_sector_queries: Number of write requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_write_l1_sector_queries: Number of write requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_l1_sector_queries: Number of read requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_l1_sector_queries: Number of read requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_l1_sector_queries: Number of read requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_l1_sector_queries: Number of read requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_tex_sector_queries: Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_tex_sector_queries: Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_tex_sector_queries: Number of read requests from Texture cache to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_tex_sector_queries: Number of read requests from Texture cache to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_read_sysmem_sector_queries: Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_read_sysmem_sector_queries: Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_read_sysmem_sector_queries: Number of system memory read requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_read_sysmem_sector_queries: Number of system memory read requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_write_sysmem_sector_queries: Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp1_write_sysmem_sector_queries: Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp2_write_sysmem_sector_queries: Number of system memory write requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp3_write_sysmem_sector_queries: Number of system memory write requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.
l2_subp0_total_read_sector_queries: Total read requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp1_total_read_sector_queries: Total read requests to slice 1 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp2_total_read_sector_queries: Total read requests to slice 2 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp3_total_read_sector_queries: Total read requests to slice 3 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp0_total_write_sector_queries: Total write requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp1_total_write_sector_queries: Total write requests to slice 1 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp2_total_write_sector_queries: Total write requests to slice 2 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
l2_subp3_total_write_sector_queries: Total write requests to slice 3 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access.
Domain domain_c:
gld_inst_8bit: Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_16bit: Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_32bit: Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_64bit: Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.
gld_inst_128bit: Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.
gst_inst_8bit: Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_16bit: Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_32bit: Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_64bit: Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.
gst_inst_128bit: Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.
Domain domain_d:
prof_trigger_00: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_01: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_02: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_03: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_04: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_05: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_06: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
prof_trigger_07: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.
warps_launched: Number of warps launched on a multiprocessor.
threads_launched: Number of threads launched on a multiprocessor.
inst_issued1: Number of single instruction issued per cycle
inst_issued2: Number of dual instructions issued per cycle
inst_executed: Number of instructions executed, do not include replays.
shared_load: Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.
shared_store: Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.
local_load: Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.
local_store: Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.
gld_request: Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space.
gst_request: Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space.
atom_count: Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction.
gred_count: Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction
branch: Number of branch instructions executed per warp on a multiprocessor.
divergent_branch: Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a conditional branch.
active_cycles: Number of cycles a multiprocessor has at least one active warp. This event can increment by 0 - 1 on each cycle.
active_warps: Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.
sm_cta_launched: Number of thread blocks launched on a multiprocessor.
local_load_transactions: Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
local_store_transactions: Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
l1_shared_load_transactions: Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
l1_shared_store_transactions: Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
__l1_global_load_transactions: Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
__l1_global_store_transactions: Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
l1_local_load_hit: Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_local_load_miss: Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_local_store_hit: Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_local_store_miss: Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively.
l1_global_load_hit: Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
l1_global_load_miss: Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.
uncached_global_load_transaction: Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
global_store_transaction: Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.
shared_load_replay: Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes).
shared_store_replay: Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be stored in one cycle.
global_ld_mem_divergence_replays: global ld is replayed due to divergence
global_st_mem_divergence_replays: global st is replayed due to divergence