http://www.brendangregg.com/methodology.html
The USE Method : for finding resource bottlenecks The TSA Method : for analyzing application time Off-CPU Analysis : for analyzing any type of thread wait latency Active Benchmarking : for accurate and successful benchmarkingSummaries
I first summarized various methodologies for my USENIX LISA2012 talk "Performance Analysis Methodology" (PDF, slideshare,youtube, USENIX), then later documented them in my Systems Performance book. The following is my most up to date summary list, with methodologies enumerated.
Blame-Someone-Else Anti-Method
Find a system or environment component you are not responsible for Hypothesize that the issue is with that component Redirect the issue to the responsible team When proven wrong, go to 1Streetlight Anti-Method
Pick observability tools that are: familiar found on the Internet found at random Run tools Look for obvious issuesDrunk Man Anti-Method
Change things at random until the problem goes awayRandom Change Anti-Method
Measure a performance baseline Pick a random attribute to change (eg, a tunable) Change it in one direction Measure performance Change it in the other direction Measure performance Were the step 4 or 6 results better than the baseline? If so, keep the change; of not, revert Goto step 1Passive Benchmarking Anti-Method
Pick a benchmark tool Run it with a variety of options Make a slide deck of the results Hand the slides to managementAd Hoc Checklist Method
..N. Run A, if B, do CProblem Statement Method
What makes you think there is a performance problem? Has this system ever performed well? What has changed recently? (Software? Hardware? Load?) Can the performance degradation be expressed in terms of latency or run time? Does the problem affect other people or applications (or is it just you)? What is the environment? What software and hardware is used? Versions? Configuration?Scientific Method
Question Hypothesis Prediction Test AnalysisWorkload Characterization Method
Who is causing the load? PID, UID, IP addr, ... Why is the load called? code path What is the load? IOPS, tput, type How is the load changing over time?Drill-Down Analysis Method
Start at highest level Examine next-level details Pick most interesting breakdown If problem unsolved, go to 2By-Layer Method
Dynamic languages Executable Libraries Syscalls Kernel: FS, network Device driversLatency Analysis Method
Measure operation time (latency) Divide into logical synchronous components Continue division until latency origin is identified Quantify: estimate speedup if problem fixedTools Method
List available performance tools (optionally add more) For each tool, list its useful metrics For each metric, list possible interpretation Run selected tools and interpret selected metrics.USE Method
Utilization Saturation ErrorsStack Profile Method
Profile thread stack traces, on- and off-CPU Coalesce Study stacks bottom-upOff-CPU Analysis
Profile scheduler per-thread off-CPU time with stack traces Coalesce times with like stacks Study stacks from largest to shortest timeTSA Method
For each thread of interest, measure time in operating system thread states. Eg: Executing Runnable Swapping Sleeping Lock Idle Investigate states from most to least frequent, using appropriate toolsActive Benchmarking Method
Configure the benchmark to run for a long duration While running, analyze performance using other tools, and determine limiting factors