Tuning multi-threaded applications

原创于 2021-06-21 19:39:03 发布 · 149 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#thread

本文概述了五个关键领域的优化技巧，包括线程同步、系统总线优化、内存管理、前端优化和执行资源调度。核心实践包括减少锁竞争、利用缓存优化、平衡负载和智能使用硬件资源。阅读以提升多线程软件的性能和响应速度。

The optimization guideline covers five specific areas (arranged in order of importance):
■Thread synchronization
■Bus optimization
■Memory optimization
■Front-end optimization
■Execution-resource optimization

Key practices of thread synchronization
Key practices for minimizing the cost of thread synchronization are summarized below:
■Insert the pause instruction in fast spin loops and keep the number of loop repetitions to a minimum to improve overall system performance.
■Replace a spin lock that may be acquired by multiple threads with pipeline locks so that no more than two threads have write accesses to one lock. If only one thread needs to write to a variable shared by two threads, there is no need to acquire a lock.
■Use a thread-blocking API in a long idle loop to free up the processor.
■Prevent false sharing of per-thread data between two threads.
■Place each synchronization variable alone, separated by a cache line (128 bytes for Intel Pentium 4 processors).
■Always regression test an application with the number of threads limited to one, so that its performance can be assessed in situations where multiple threads are not available or do not work
Key practices of system-bus optimization
Managing bus traffic can significantly impact the overall performance of multi-threaded software and MP systems. Key practices of system-bus optimization for achieving high data throughput and quick response are:
■Improve data and code locality to conserve bus-command bandwidth.
■Avoid excessive use of software prefetch instructions and allow the automatic hardware prefetcher to work. Excessive use of software prefetches can significantly and unnecessarily increase bus utilization if used inappropriately.
■Consider using overlapping multiple back-to-back memory reads to improve effective cache-miss latencies.
■Use full write transactions to achieve higher data throughput.
Key practices of memory optimization
Key practices for optimizing memory operations are summarized below:
■Use cache blocking to improve locality of data access. Target one-quarter to one-half of the cache size when targeting Intel architecture 32-bit processors with Hyper-Threading Technology.
■Minimize the sharing of data between threads that execute on different physical processors sharing a common bus.
■Minimize data-access patterns that are offset by multiples of 64KB in each thread.
■Adjust the private stack of each thread in an application so the spacing between these stacks is not offset by multiples of 64KB or 1MB to prevent unnecessary cache-line evictions, when targeting Intel architecture 32-bit processors with Hyper-Threading Technology.
■When targeting Intel architecture 32-bit processors with Hyper-Threading Technology, add a per-instance stack offset when two instances of the same application are executing in lock steps to avoid memory accesses that are offset by multiples of 64KB or 1MB.
■Evenly balance workloads between processors, physical or logical. Load imbalance occurs when one or more processors sit idle waiting for other processors to finish.Load imbalance issues can be as simple as one thread completing its allocated work before the others. Resolving imbalance issues typically requires splitting the work into smaller units that can be more evenly distributed across available resources.
Key practices of front-end optimization
Key practices for front-end optimization are:
■Avoid excessive loop unrolling to ensure the trace cache is operating efficiently.
■Optimize code size to improve locality of trace cache and increase delivered trace length.
Key practices of execution-resource optimization
Each physical processor has dedicated execution resources, and the logical processors in each physical processor that supports Hyper-Threading Technology share on-chip execution resources. Key practices for execution-resource optimization include:
■Optimize each thread to achieve optimal frequency scaling first.
■Optimize multi-threaded applications to achieve optimal scaling with respect to the number of physical processors.
■To ensure compatibility with future processor implementations, do not hard code the value of cache sizes or cache lines into an application; instead, always query the processor to determine the sizes of the shared cache resources.
■Use on-chip execution resources cooperatively if two threads are sharing the execution resources in the same physical processor package.
■For each processor with Hyper-Threading Technology, consider adding functionally uncorrelated threads to increase the hardware-resource utilization of each physical processor package.