Nvidia Visio profiler
usage
nvprof [options] [application] [application-arguments]
usage mode
- summary mode
nvprof matrixMul
- track gpu trace
nvprof --print-gpu-trace matrixMul
track API trace
nvprof --print-api-trace matrixMulNote: API trace can be turned off, if not needed, by using –profile-api-trace none. This reduces some of the profiling overhead, especially when the kernels are short.
Event/metric Summary Mode
nvprof --events warps_launched,local_load --metrics ipc matrixMul
- some userul events and metrics
events
- gld_inst_32bit
- gst_*
- global_load
- global_store
- local_*
- warp_launched
- active_cycles
- *_warps/ctas
- tex0_cache_sector_queries
- tex1_
metrics
- ipc
- gld_transactions_per_request
- gst_*
- gld_efficiency
- sm_efficiency
- l2_read_transactions
- l2_tex_transactions
- l2_utilization
example
nvprof --events warps_launched,local_load --metrics ipc matrixMul
- Event/metric Trace Mode
nvprof --aggregate-mode off --events local_load --print-gpu-trace matrixMul
other important options
- –dependency-analysis
- Timeline
nvprof --export-profile timeline.prof <app> <app args>
- Metrics And Events
The second use case is to collect events or metrics for all kernels in an application for
which you have already collected a timeline. Collecting events or metrics for all kernels will significantly change the overall
performance characteristics of the application because all kernel executions will be serialized on the GPU.
Even though overall application performance is changed, the event or metric values
for individual kernels will be correct and so you can merge the collected event and metric values
onto a previously collected timeline to get an accurate picture of the applications behavior.
nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <app> <app args>
- Analysis For Individual Kernel
nvprof --kernels <kernel specifier> --analysis-metrics -o analysis.prof <app> <app args>
metric reference
http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference
simple compile app
.cu -> .ptx -> .cubin ->exe
you can use “nvcc -keep” to preserve the middle compiled files
–ptxas-option=-v to see verbose compilation output
- number of registers used
- shared memory bytes
- local memory in bytes
cuobjdump
a disassemble tool
static inst
cuda timing functions
eg.
cudaEvent_t start,stop;
float elapsed ;
cudaEventCreate(&start);
cudaEventCreate(& stop);
cudaEventRecord(start,0);
fool_kernel<<<grid,block>>>();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed,start,stop);//返回ms,精确在0.5ms,不是很精确
printf("elapsed time %f (seconds) \n",elapsed/1000);

2170

被折叠的 条评论
为什么被折叠?



