Book: http://book.douban.com/subject/3564286/
Authors: Calvin Lin (UT Austin), Lawrence Snyder (UW)
Chapter 1. Introduction
Instruction Level Parallelism (ILP): modern processor architecture, transparent to sequential programs (hidden parallelism)
However, prospects of ILP are becoming limited: existing techniques for exploiting ILP have largely reached the point of diminishing returns, in terms of both power consumption and performance.
--> multi-core computers, GPGPU; supercomputers, clusters, servers, grid computing
parallel computing vs. distributed computing:
1. Goal: parallel computing is to provide performance (processor power or memory) using multiple processors; distributed computing is to provide convenience (availability, reliability, physical distribution)
2. parallel computation values short execution time (interaction among processors is frequent, fine-grained with low overhead, assumed to be reliable); distributed computation values long uptime (interaction is infrequent, heavier weight, assumed to be unreliable)
concurrency and parallelism: (used in book interchangeably for logical concurrency)
concurrency is widely used in OS and DB communities to describe executions that are logically simultaneous;
parallelism is typically used by architecture and supercomputing communities to describe executions that physically execute simultaneously.
(Further reading: http://existentialtype.wordpress.com/2011/03/17/parallelism-is-not-concurrency/
http://existentialtype.wordpress.com/2014/04/09/parallelism-and-concurrency-revisited/ )
Thread: a thread has everything needed to execute a stream of instructions -- a private program text, a call stack, and a program counter -- but it shares access to memory with other threads. Thus, multiple threads can coopperate to compute on global data.
Race condition: a race condition exists when the result of an execution depends on the timing of two or more events --> mutex (mutual exclusion), atomicity (critical section) --> lock contention
Coherent cache protocol: to ensure that both processors see the same memory image, if processor 0 modifies a value at a given memory location, the hardware will invalidate any cached copy of that memory location that resides in processor 1's L1 cache, thereby preventing processor 1 from accidentally accessing a stale value of the data
False sharing: logically distinct data shares a physical cache line (the unit of cache coherent) --> pad private data so that each resides on a distinct cache line
Parallel algorithm: summation, prefix sum <== tournament algorithm: pair-wise summation (for prefix sum, up sweep + down sweep + passing values, logarithmic time complexity)
Count 3s: limited bandwidth to memory prevents improvement using more threads (or multi-core chips, bandwidth per core shrinks as increased number of cores per chip)
(1) mutex --> control interaction among processors to deal with race condition, (2) private counters --> granularity of parallelism to avoid excessive lock overhead, (3) padding --> understanding machine details to correct false sharing, (4) even more processors have no improvement --> limitation of L2 memory bandwidth
trade off a small amount of memory for increased parallelism and increased performance
Goals: correctness, good performance, scalability, portability
Chapter 2. Understanding Parallel Computer
Balancing machine specifics with portability --> Six parallel computers: chip multiprocessors, symmetric multiprocessor architectures, heterogeneous chip designs, clusters, supercomputers
Chip Multiprocessors
cache coherency protocol: e.g. MESI (Modified, Exclusive, Shared and Invalid) --> overhead, more bandwidth requirement
Intel Core Duo has shared L2 cache, managing coherence at the front of the L2 --> allows processor to use more than its share of L2 cache & lower latency on-chip communication, preferred by a single 2-processor chips;
In AMD Dual Core L2 is private to a processor, managing coherence in the System Request Interface at the back of the L2 --> preferred by SMPs
Symmetric Multiprocessor Architectures
all processors access a single logical memory; processors are connected at a common point, the memory bus, where each processor can snoop on the memory reference activity --> adjusting the tags on their cached values to ensure coherent cache usage
the bus is a potential bottleneck (serial use of the bus limits the number of processors connected) --> ample L2 cache helps reduce congestion on the bus; multiple-core processors won't help because of increased memory requests to the bus
e.g. Sun Fire E25K: 18x18 crossbar interconnects for address, response and data, and 18 snoopy bus; share memory of 1.15TB
Heterogeneous Chip Designs
attached processors: Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Cell processors
e.g. the Cell does not provide coherent memory for the synergistic processing elements (SPEs), choosing performance and hardware simplicity over programmability
Clusters
key property: memory is not shared among the commodity machines; processors communicate by passing messages
Supercomputers
e.g. BlueGene/L: processors have slower clock rate, 65536 dual core nodes; nodes are arranged in 3-dimensional torus network, each node connected to its six nearest neighbor nodes; communication between processors that are not directly connected in the torus is routed along a path through the network
torus network + collective network (for arithmetic capabilities, broadcasts) + barrier network (for synchronization)
Observation: shared memory machine vs. distributed memory machine; SISD (usual sequential computer), MISD (multiple redundant computations), SIMD (Cell's SPE), MIMD (most parallel machines today)
An abstraction of a sequential computer: the Random Access Machine (RAM) model, aka the von Neumann model --> any memory location can be referenced (read or written) in "unit" time without regard to its location
PRAM, parallel random access machine model: fails by misrepresenting memory behavior, because it's impossible to realize the unit-time single memory image for scalable machines (the delays required to keep memory consistent grow dramatically as the number of execution units increases)
--> PRAM does not work well as a model for programmers (PRAM completely ignores communication cost)
e.g. Valiant's algorithm for finding the maximum, though theoretically interesting and clever, is not faster than the tournament algorithm in practice. (http://courses.cs.washington.edu/courses/csep524/99wi/lectures/lecture10/sld009.htm)
Candidate Type Architecture (CTA)
CTA accounts for communication costs. It explicitly separates two types of memory references, namely, inexpensive local references and expensive non-local refs.
composed of: P standard sequential computers (processor + RAM + Network Interface Chip) connected by an interconnection network (communication network); and a (implicit) controller to assist with initialization, synchronization, eureka, etc.
topology of interconnection network: e.g. 2d torus, binary 3-cube, fat tree, omega network, etc.
node degree: processor connected to the network by how many wires (typically few --> one or two network transfers in flight at once)
The model has no global memory: there are 3 widely used mechanisms to make non-local memory references: shared-memory, one-sided communication, and message passing --> from the CTA model perspective, they are interchangeable.
non-local memory latency: \lambda>>1, two to five orders of magnitude larger than local memory reference time (unit time)
(CTA ignores external I/O, whose cost is even harder to generalize)
Locality Rule: fast programs tend to maximize the number of local memory references and minimize the number of non-local memory references
(applying the locality rule --> redundancy but parallel)
shared memory: present a single coherent memory image to multiple threads --> easy to create race conditions; hard to debug
1-sided communication: support a single shared "address space" (all threads can reference all memory locations, but it does not keep the memory coherent); get() / put() --> simplify hardware by removing cache coherency protocol; require that programmers protect key variables with synchronization protocol
message passing: send() / recv(), 2-sided mechanism --> least hardware support; programmer must reason about distributed data structures and use two distinct mechanisms for moving data: memory references for local memory and message passing for non-local memory
Memory consistency model: because modern microprocessors employ latency hiding techniques that improve performance but affect semantics of parallel programs
relevant to parallel computers that implement a shared address space (shared memory or one-sided communication)
sequential consistency: result of any execution is the same as if (a) the operations of all processors were executed in some sequential order, and (b) the operations of each individual processor appear in the order specified by its program --> limiting performance by restricting latency-hiding such as buffering and pipelining
relaxed consistency model: (fundamentally, it is difficult to build memory that is both large and fast, so improvements in memory latencies have not kept up with improvements in CPU speed) To reduce the latency of writes, modern microprocessors employ a "write buffer" residing on the same chip as the processor; write --> store in buffer and do not wait for actually writing to main memory; subsequent read to other address can then be serviced by DRAM without waiting but violating sequential consistency
Difficulty of relaxed consistency: e.g. while mutex lock was set but still in write buffer, another processor sees the old value --> critical section is not protected
--> modern microprocessors implement atomic operations that are guaranteed not to use the write buffer, e.g. test_and_swap
// critical section is empty when lock=0
do {
old = test_and_swap(&lock, 1);
} while (old != 0);
// enter critical section
test_and_swap(&lock, 0);
// exit critical section; clear lock so others can enter
parallel programming often breaks abstractions by preventing us from hiding certain low level details
Software interfaces that do not match the underlying hardware interface can be built (e.g. Message Passing Interface is universally implemented, even on shared memory; Virtual Shared Memory is implemented in software)
Reducing the impact of communication latency (non-local memory latency \lambda) is at the heart of nearly all programming efforts
Direct Memory Access (DMA) requires that there be wires connecting communicating processors (bus --> serialized, or crossbar --> unrealistic except for very small computer, say n<=32); navigation through the network is subject to switching delays, collisions, congestion, and so on.
Chapter 3. Reasoning about Performance
thread: a thread of control, logically consisting of program code, a program counter, a call stack, and some thread-specific data including a set of general purpose registers. Threads share access to memory and the file system. --> shared memory parallel programming
process: a thread that also has its own private address space --> message passing parallel programming
latency: the amount of time it takes to complete a given unit of work --> parallelism can hide latency (e.g., as in OS, rather than remain idle, try to switch context and make progress on some other part of the computation; or hardware techniques to actually reduce latency, such as the use of cache and memory prefetching)
throughput: the amount of work that can be completed per unit time --> parallelism can improve throughput (e.g. processor instruction pipeline)
Sources of performance loss:
(1). overhead: setting up and tearing down thread/process, communication, synchronization, computation, memory
(communication overhead:
shared memory: transmission delay, coherency operations, mutual exclusion, contention;
1-sided: transmission delay, mutual exclusion, contention;
message passing: transmission delay, data marshaling, message formation, demarshaling, contention)
(2). non-parallelizable computation:
Amdahl's Law: if 1/S of a computation is inherently sequential, then the maximum performance improvement is limited to a factor of S
Amdahl's law is not a proof that applying large numbers of processors to a problem will have limited success; as the size of the problem grows, the sequential portion may diminish
(3). idle processors: often a consequence of synchronization and communication --> consider Load Balance and Memory Bound (bandwidth & latency)
(4). contention for resources: e.g. if the lock is implemented as a spin lock, in which waiting thread repeatedly request a lock, the waiting lock will increase bus traffic --> harmful for all threads (even those that are not contending for the same lock) to access the shared bus
parallel structure: (a) dependence --> reason about sources of inefficiency; (b) granularity --> match a computation to underlying hardware; (c) locality --> guide to solutions that will naturally have suitable granularity and few dependences
Data Dependence: an ordering on a pair of memory operations that must be preserved to maintain correctness
3 kinds of data dependences:
(1) flow dependence: read after write --> true dependences that cannot be eliminated
(2) anti dependence: write after read --> false dependences can be removed by renaming variables, increasing concurrency at the cost of increasing memory usage
(3) output dependence: write after write
(input dependence: read after read --> no ordering constraints, but can be useful for reasoning about temporal locality)
... in effect, when we gave the C specification for adding the numbers, we were specifying more than just which numbers to add; we were implicitly specifying ordering constraints --> must be careful to avoid introducing dependences that do not matter to the computation
Granularity of parallelism: is determined by the frequency of interactions among threads or processes. Here, frequency is measured in the number of instructions between interactions.
e.g. message passing typically work best when phrased as coarse-grained computations, because the overhead of message passing is typically large
Locality (the reason why caches work): temporal locality & spatial locality
in parallel context, locality has the added benefit of minimizing dependences among threads or processes, thereby reducing overhead and contention
In sequential computations, when memory system performance is the bottleneck, programmers are generally encouraged to avoid premature optimization by remembering the 90/10 rule, which states that 90 percent of the time is spent in 10 percent of the code.
--> if improvements needed, identify and rewrite 10 percent of the code that dominates execution time
In parallel computations, more complex: amdahl's law, instruction counts, communication time, waiting time, dependences, dynamic effects (e.g. contention) ...
trade-offs:
communication vs. computation: overlapping communication and computation (that is independent of the communication); redundant computation (recompute a value locally rather than wait for it to be transmitted, which also removes dependences)
memory vs. parallelism: privatization (to break false dependences); padding (to force variables to reside on their own cache line)
overhead vs. parallelism: parallelize overhead (e.g. use tree summation for final accumulation); load balance vs.overhead (over-decomposing the problem into a large number of fine-grained units of work, which is easier to distribute evenly); granularity trade-off (batching, increase the granularity of interaction --> reduce the number of dependences)
functional languages: because they provide referential transparency (the value of an expression does not change over time), an expression's sub-expressions can be evaluated in any order without affecting its result --> functional languages makes it trivial to identify substantial amounts of fine-grained parallelism
however, there is considerably more to obtain performance than simply identifying parallelism: manage data movement and interaction among threads, choose appropriate granularity of parallelism --> functional languages provide no mechanism for controlling locality (FL abtracts away memory locations, making it difficult for programmer/compiler to reason about locality), granularity (difficult to know which expressions to combine to create coarser-grained threads), and cross-thread dependences
performance metrics:
execution time (latency), or FLOPS (floating-point operations per second)
speedup = T_sequential / T_parallel (plotted on y-axis with the number of processors on the x-axis)
The speedup curves typically level off as the number of processors increases -- a result of keeping the problem size constant, which causes the amount of work per processor to decrease and thus costs such as overhead become more significant, causing the total execution time not to improve linearly
superlinear speedup: possible when the parallel program does less work. The most common situation occurs (1) when the parallel execution is able to access data that fits in each processor's cache, while the sequential execution must access the slower parts of the memory system because the data does not fit in a single cache (or in the case of physically distributed system). (2) when performing a search that is terminated as soon as the desired element is found (parallel search may be performed in different order).
efficiency = speedup / P. An ideal efficiency of 1 indicates linear speedup.
concerns with speedup: speedup is a unitless metric that would seem to factor out technological details such as processor speed, but in fact such details subtly affect speedup:
(1) hardware generations. Because communication latency has not kept pace with improvement of processor performance (thus, the time spent communicating will not diminish as much as the time spent computing), speedup values have generally decreased over time.
(2) sequential time. T_sequential should not be inflated by turning off the compiler optimizations for both sequential and parallel programs, because this slows the processors and thus relatively improves the communication latency.
(3) relative speedup: use T_parallel with P=1 as P_sequential. This should not be misreported as true speedup (except for the case that it is really faster than any known sequential algorithm).
(4) cold starts. It is good practice to run a computation a few times, measuring only the later runs.
(5) peripheral charges greatly complicate the speedup analysis.
scaled speedup vs. fixed-size speedup: a fixed-size problem is likely to bias toward some particular number of processors
--> scale the problem size with the number of processors. (However, it's not always clear what "twice as large" really means, since the asymptotic complexity of most computations does not grow linearly; the memory and communication requirements do not always grow at the same rate as the computational requirements.)
Scalable performance is difficult to achieve
implications for hardware: why we can afford to use less powerful processors as we increase the number of processors? as the number of processors increases, the marginal benefit of improving each processor's CPU speed is minimal. (with slower cores a smaller fraction of the original computation is non-parallelizable, so Amdahl's law suggests better speedup and efficiency by using slower cores)
implications for software: trade-offs become significant, e.g. batching is beneficial for algorithms with a good "surface area to volume ratio"
the corollary of modest potential: the basic algorithm needs to be as scalable as possible, and simply adding more processors does not change this fact -- it exacerbates it. e.g., for algorithm of O(n^x), T = c (mn)^x / P = c n^x --> m = P^(1/x), meaning that to increase by a factor of 100 a problem whose asymptotic complexity is O(n^2), we need 10,000 processors.
increased parallelism by itself does not translate to increased performance (either reduced latency or increased throughput)
Challenge facing many-core:
communication latency grows --> better employ larger-grained parallelism --> many independent processes will suffer because of limited aggregate bandwidth between RAM and chip --> limited by physical dimensions of a many-core chip
Chapter 4. First Steps Toward Parallel Programming
data parallel: perform the same operation to different items of data at the same time; the amount of parallelism grows with the size of the data
task parallel: perform distinct computations or tasks at the same time; the parallelism is not scalable
Peril-L (homophonic with parallel): forall, exclusive, barrier, global and local address space, localize() mapping (global data structures are distributed throughout local memories --> "owner computes rule"), full/empty variable (FE variable), reduce and scan, ...
fixed parallelism, unlimited parallelism, scalable parallelism
a scalable parallelism solution to Count 3s:
int array[length]; // the data is global
int t; // number of desired threads
int total; // results of computation, grand total
forall (j in 0..t-1) {
int size = mySize(array, 0); // figure size of local part of global data
int myData[size] = localize(array[]); // associate my part of global data with local variable
int i, priv_count = 0; // local accumulation
for (int i = 0; i < size; ++i) {
if (myData[i] == 3)
++priv_count;
}
total =+/priv_count; // reduce to compute grand total
}
Alphabetizing example:
unlimited parallelism --> odd/even interchange sort (https://en.wikipedia.org/wiki/Odd%E2%80%93even_sort);
fixed parallelism --> 26-thread alphabetic batches, localizing, +-reduce, sort in place, +-scan, each item incurs only 2\lambda communication cost (move from global location to local memory, and then back to its final position), require synchronization to perform +-scan to ensure all data have moved out of global array before returning alphabetized batches;
scalable parallelism --> Batcher's Bitonic Sort (https://en.wikipedia.org/wiki/Bitonic_sorter, scalable and parallel version of the sequential mergesort algorithm)
(source code: refer to the book)
Chapter 10. Future Directions in Parallel Programming
Attached processors: special purpose processors can be much more power and space efficient than general purpose microprocessors --> offload as much work (both computation and memory traffic) as possible
GPGPU (general purpose computing on graphical processing units), Cell Processor (with eight attached processors and an embedded Power PC host), FPGA (Field Programmable Gate Arrays)
GPGPU (a modern GPU looks like a massively parallel fine-grained multi-core chip with its own on-chip DRAM): strength: floating point performance, bandwidth
CUDA (Compute Unified Device Architecture): the programming model is not just SIMD
Cell: PowerPC with L1/L2 cache + eight SPEs (synergistic processing elements with no cache) + EIB (element interconnect bus, 200~300 GB/s)
Grid computing: outsource computing, analogy to the power grid
computing grids face many of the same issues that the distributed computing community address: resource management, availability, transparency, heterogeneity, scalability, fault tolerance, security, privacy + performance and application scalability (from parallel computing)
Transactional Memory (TM)
database transaction: modify data to ensure four properties, known as ACID (atomicity, consistency, isolation, durability) --> durability is not needed in a TM system
TM: programmer identifies transactions, and the system enforces the semantics of transactions by tracking their loads and stores to memory and detecting any conflicts that would violate the transactional properties ACI
benefits: scalable, composable, deadlock free, and easy to use (e.g. atomic region)
Comparison with Locks:
(1) locks can lead to deadlock; transactions cannot deadlock: they either commit or abort. (livelock, in which a transaction makes no progress because it repeated aborts. is a possibility)
(2) locks are too strict: they enforce sequential execution even when it is not needed (concurrent reads to a shared memory location, and concurrent writes or writes and reads to different memory locations); transaction memory allows them to execute concurrently
(3) locks face a granularity trade-off: coarse-grained locks limit concurrency and thus scalability, while fine-grained locks are difficult to reason about because of the possibility of deadlock; transactional memory allows large atomic sections to execute on multiple threads (e.g. atomic region is different from Java's synchronized statement, because synchronized statement serializes the execution of the entire method)
(4) locks do not compose well; with transactional memory, all dependences are dynamically detected and resolved, so there is no notion of statically ordered lock acquisition (fundamentally locks statically specify an implementation strategy, which is suboptimal because the actual interleaving of threads is not known until runtime)
Implementation of TM: two basic operations: conflict detection and data versioning
conflict detection: check whether two threads access the same memory location, with at least one of the operations being a modification of that location (pessimistic and optimistic schemes)
data versioning: maintains multiple versions of data so that transactions can be either rolled back or committed (eager and lazy versioning)
Problem Space Promotion (PSP): a parallel solution technique for problems involving combinatorial interactions of array data
PSP reformulates algorithms that operate over d-dimensional data as computations in a higher dimensional problem space. The goal of PSP is to replicate data in one or more dimensions to increase parallelism and to reduce communication and synchronization requirements, that is, to remove dependences.
Chapter 11. Writing Parallel Programs
Preliminary step:
... before starting in earnest with the programming, it is prudent to test that the program execution can be measured. A key part of working with parallel computation is measuring their performance, so it is critical that you know how to get accurate timings.
When checking the timing facilities, be sure to print some value that is referenced in the timed code. --> ensure that optimizing compilers will not eliminate it
Parallel programming recommendations:
* Incremental Development
* Focus on the Parallel Structure (first and then inserting functional components later) --> "top level parallelism"
(Setup and Initialize data structures --> Spawn threads that each iterates over part of the data structure and interacts with other threads at the end of each cycle --> Summarize the data with a reduction --> Print results --> Exit)
* Testing the Parallel Structure (first checking interactions under controlled conditions to verify that the interactions are right)
* Sequential Programming (can be developed and tested independently of the main parallel program)
* Be Willing to Write Extra Code: (1) to eliminate race conditions with hooks that invoke artificial delays or relinquish processor, (2) to facilitate testing, (3) to print a globally coherent view of distributed data structure, (4) to implement checkpointing facility that periodically saves the execution state for resuming, (4) to understand performance bottlenecks
* Controlling Parameters during Testing: sufficient data, sufficient interactions, extreme schedules
* Functional Debugging
Standard Benchmarks: e.g. NAS Parallel Benchmarks (NPB) http://www.nas.nasa.gov/publications/npb.html
Performance Analysis --> Experimental Methodology (load imbalance, lock contention, excessive communication ...)
... run the computation up to a barrier, capture the program state immediately after the barrier, and write the data to a file. This approach allows the program to be repeatedly restarted in the post-barrier state.