Parallel Programming with Hints(转)

本文深入探讨了自动并行化的概念及其在Haskell和.NET中的实现方式,通过实例展示了如何在这些编程环境中利用并行化提高程序性能,并对比了两者在并行编程上的特点与应用。

Parallel Programming with Hints(转)

 

Wouldn’t it be nice to be able to write sequential programs and let the compiler or the runtime automatically find opportunities for parallel execution? Something like this is already being done on a micro scale inside processors. As much as possible, they try to execute individual instructions in parallel. Of course they have to figure out data dependencies and occasionally stall the pipeline or idle while waiting for a memory fetch. More sophisticated processors are even able to speculate–they guess a value that hasn’t been calculated or fetched yet and run speculative execution. When the value is finally available, they compare it with their guess and either discard or commit the execution.

If processors can do it, why can’t a language runtime do the same on a larger scale? It would solve the problem of effectively using all those cores that keep multiplying like rabbits on a chip.

The truth is, we haven’t figured out yet how to do it. Automatic parallelization is, in general, too complex. But if the programmer is willing to provide just enough hints to the compiler, the runtime might figure things out. Such programming model is called semi-implicit parallelism and has been implemented in two very different environments, in Haskell and in .NET. The two relevant papers I’m going to discuss are Runtime Support for Multicore Haskell and The Design of a Task Parallel Library.

In both cases the idea is to tell the compiler that certain calculations may be done in parallel. It doesn’t necessarily mean that the code will be executed in multiple threads–the runtime makes this decision depending on the number of cores and their availability. The important thing is that, other than providing those hints, the programmer doesn’t have to deal with threads or, at least in Haskell, with synchronization. I will start with Haskell but, if you’re not into functional programming, you may skip to .NET and the Task Parallel Library (and hopefully come back to Haskell later).

In Multicore Haskell

In Haskell, you hint at parallel execution using the par combinator, which you insert between two expressions, like this: e1 `par` e2. The runtime then creates a spark for the left hand side expression (here, e1). A spark is a deferred calculation that may be executed in parallel (in the .NET implementation a spark is called a task). Notice that, in Haskell, which is a lazy programming language, all calculations are, by design, deferred until their results are needed; at which point their evaluation is forced. The same mechanism kicks in when the result of a spark is needed–and it hasn’t been calculated in parallel yet. In such a case the spark is immediately evaluated in the current thread (thus forfeiting the chance for parallel execution). The hope is though that enough sparks will be ready before their results are needed, leading to an overall speedup.

To further control when the sparks are evaluated (whether in parallel or not), Haskell provides another combinator, pseq, which enforces sequencing. You insert it between two expressions, e1 `pseq` e2, to make sure that the left hand side, e1, is evaluated before the evaluation of e2 is started.

I’ll show you how to parallelize the standard map function (in C++ it would be called std::transform). Map applies a function, passed to it as the first argument, to each element of a list, which is passed to it as the second argument. As a Haskell refresher, let me walk you through the implementation of map.

map f []     = []
map f (x:xs) = y:ys
    where y  = f x
          ys = map f xs

Map is implemented recursively, so its definition is split into the base case and the recursive case. The base case just states that map applied to an empty list, [], returns an empty list (it ignores the first argument, f).

If, on the other hand, the list is non-empty, it can be split into its head and tail. This is done through pattern matching–the pattern being (x:xs), where x matches the head element and xs the (possibly empty) tail of the list.

In that case, map is defined to return a new list, (y:ys) whose head is y and tail is ys. The where clause defines those two: y is the result of the application of the function f to x, and ys is the result of the recursive application of map to the tail of the list, xs.

The parallel version does the same (it is semantically equivalent to the sequential version), but it gives the runtime the opportunity to perform function applications in paralle. It also waits for the evaluation to finish.

parMap f []     = []
parMap f (x:xs) = y `par` (ys `pseq` y:ys)
    where y  = f x
          ys = parMap f xs

The important changes are: y, the new head, may be evaluated in parallel with the tail (the use of the par combinator). The result, y:ys, is returned only when the tail part, ys, has been evaluated (the use of the pseq combinator).

The tail calculation is also split into parallel computations through recursive calls to parMap. The net result is that all applications of f to elements of the list are potentially done in parallel. Because of the use of pseq, all the elements (except for the very first one) are guaranteed to have been evaluated before parMap returns.

It’s instructive to walk through the execution of parMap step-by-step. For simplicity, let’s perform parMap on a two-element list, [a, b].

First we pattern-match this list to x = a and xs = [b]. We create the first spark for the evaluation of (y = f a) and then proceed with the evaluation of the right hand side of par, (ys `pseq` y:ys). Here ys = parMap f [b].

Because of the `pseq`, we must evaluate ys next. To do that, we call (parMap f [b]). Now the list [b] is split into the head, b, and the empty tail, []. We create a spark to evaluate y' = f b and proceed with the right-hand side, (ys' `pseq` y':ys').

Again, the `pseq` waits for the evaluation of ys' = parMap f []. But this one is easy: we apply the base definition of parMap, which returns an empty list.

Now we are ready to retrace our steps. The right hand side of the last `pseq` re-forms the list y':[]. But that’s the ys the previous `pseq` was waiting for. It can now proceed, producing y:(y':[]), which is the same as [y, y'] or [f a, f b], which is what we were expecting.

Notice complete absence of explicit synchronization in this code. This is due to the functional nature of Haskell. There’s no shared mutable state so no locking or atomic operations are needed. (More explicit concurrent models are also available in Haskell, using MVars or transactional memory.).

Task Parallel Library in .NET

It’s no coincidence that many ideas from Haskell end up in Microsoft languages. Many Haskell programmers work for Microsoft Research, including the ultimate guru, Simon Peyton Jones. The Microsoft Task Parallel Library (TPL) translates the ideas from Multicore Haskell to .NET. One of its authors, Daan Leijen, is a Haskell programmer who, at some point, collaborated with Simon Peyton Jones. Of course, a .NET language like C# presents a different set of obstacles to parallel programming. It operates on mutable state which needs protection from concurrent access. This protection (which, incidentally, is the hardest part of multithreaded programming) is left to the programmer.

Here’s the example of an algorithm in C# with hidden opportunities for parallel implementation. MatrixMult multiplies two matrices. It iterates over columns and rows of the result matrix. The value that goes at their intersection is calculated by the innermost loop.

void MatrixMult(int size, double[,] m1,double[,] m2, double[,] result)
{
   for(int i = 0; i < size; i++){
      // calculate the i'th column
      for(int j = 0; j < size; j++){
         result[i, j] = 0;
         for(int k = 0; k < size; k++){
              result[i, j] += m1[i, k] * m2[k, j];
         }
      }
   }
}

Each column of the result could potentially be evaluated in parallel. The problem is, the size of the array and the number of processor cores might be unknown until the program is run. Creating a large number of threads when there are only a few cores may lead to a considerable slowdown, which is the opposite of what we want. So the goal of TPL is to let the programmer express the potential for parallel execution but leave it to the runtime to create an optimal number of threads.

The programmer splits the calculation into tasks (the equivalent of Haskell sparks) by making appropriate library calls; and the runtime maps those tasks into OS threads–many-to-one, if necessary.

Here’s how the same function looks with parallelization hooks.

void ParMatrixMult(int size, double[,] m1,double[,] m2, double[,] result)
{
   Parallel.For(0, size, delegate(int i)
   {
      for(int j = 0; j < size; j++){
         result[i, j] = 0;
         for(int k = 0; k < size; k++){
              result[i, j] += m1[i, k] * m2[k, j];
         }
      }
   });
}

Because of clever formatting, this version looks very similar to the original. The outer loop is replaced by the call to Parallel.For, which is one of the parallelizing TPL functions. The inner loops are packed into a delegate.

This delegate is assigned to a task (the analog of Haskell spark) that is potentially run in a separate thread. Here the delegate is actually a closure–it captures local variables, size, m1, m2 and result. The latter is actually modified inside the delegate. This is how shared mutable state sneaks into potentially multi-threaded execution. Luckily, in this case, such sharing doesn’t cause races. Consider however what would happen if we changed the types of the matrices from double[,] to char[,]. Parallel updates to neighboring byte-sized array elements may inadvertently encroach on each other and lead to incorrect results. Programmer beware! (This is not a problem in Haskell because of the absence of mutable state.)

But even if the programmer is aware of potential sharing and protects shared variables with locks, it’s not the end of the story. Consider this example:

int sum = 0;
Parallel.For(0, 10000, delegate(int i)
{
   if(isPrime(i)){
      lock(this) { sum += i; }
   }
});

The captured variable, sum is protected by the lock, so data races are avoided. This lock, however, becomes a performance bottleneck–it is taken for every prime number in the range.

Now consider the fact that, on a 4-core machine, we’ll be running 10000 tasks distributed between about 4 threads. It would be much more efficient to accumulate the sum in four local variables–no locking necessary–and add them together only at the end of the calculation. This recipe can be expressed abstractly as a map/reduce type of algorithm (a generalization of the C++ std::accumulate). The tasks are mapped into separate threads, which work in parallel, and the results are then reduced into the final answer.

Here’s how map/reduce is expressed in TPL:

int sum = Parallel.Aggregate(
  0, 10000, // domain
  0, // initial value
  delegate(int i){ return (isPrime(i) ? i : 0) },
  delegate(int x, int y){ return x+y; }
);

The first delegate, which is run by 10000 tasks, does not modify any shared state–it just returns its result, which is internally accumulated in some hidden local variable. The second delegate–the “reduce” part of the algorithm–is called when there’s a need to combine results from two different tasks.

The Place for Functional Programming

Notice that the last example was written in very functional style. In particular you don’t see any mutable state. The delegates are pure functions. This is no coincidence: functional programming has many advantages in parallel programming.

I’ve been doing a lot of multi-threaded programming in C++ lately and I noticed how my style is gradually shifting from object-oriented towards functional. This process is accellerating as functional features keep seeping into the C++ standard. Obviously, lambdas are very useful, but so is move semantics that’s been made possible by rvalue references, especially in passing data between threads. It’s becoming more and more obvious that, in order to be a good C++ programmer, one needs to study other languages. I recommend Haskell and Scala in particular. I’ll be blogging about them in the future.

Bibliography

  1. Simon Marlow, Simon Peyton Jones, and Satnam Singh, Runtime Support for Multicore Haskell
  2. Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt, The Design of a Task Parallel Library
  3. A video introduction to Haskell by Simon Peyton Jones: Part I, Part II, together with the slides.
Homework 4: Binocular Stereo November 6, 2025 Due Date: November 27, by 23:59 Introduction In this project, you will implement a stereo matching algorithm for rectified stereo pairs. For simplicity, you will work under the assumption that the image planes of the two cameras are parallel to each other and to the baseline. The project requires implementing algorithms to compute disparity maps from stereo image pairs and visualizing depth maps. To see examples of disparity maps, run python main.py --tasks 0 to visualize the com￾parison of disparity map generated by cv2.StereoBM and the ground truth. 1 Basic Stereo Matching Algorithm (60 pts.) 1.1 Disparity Map Computation (30 pts.) Implement the function task1 compute disparity map simple() to return the disparity map of a given stereo pair. The function takes the reference image and the second image as inputs, along with the following hyperparameters: • window size: the size of the window used for matching. • disparity range: the minimum and maximum disparity value to search. • matching function: the function used for computing the matching cost. The function should implement a simple window-based stereo matching algorithm, as out￾lined in the Basic Stereo Matching Algorithm section in lecture slides 08: For each pixel in the first (reference) image, examine the corresponding scanline (in our case, the same row) in the second image to search for a best-matching window. The output should be a disparity map with respect to the first (reference) image. Note that you should also manage to record the running time of your code, which should be included in the report. 1.2 Hyperparameter Settings and Report (30 pts.) Set hyperparameters in function task1 simple disparity() to get the best performance. You can try different window sizes, disparity ranges, and matching functions. The comparison of your generated disparity maps and the ground truth maps can be visualized (or saved) by calling function visualize disparity map(). 1 Computer Vision (2025 fall) Homework 4 After finishing the implementation, you can run python main.py --tasks 1 to generate disparity maps with different settings and save them in the output folder. According to the comparison of your disparity maps and ground truth maps under different settings, report and discuss • How does the running time depend on window size, disparity range, and matching function? • Which window size works the best for different matching functions? • What is the maximum disparity range that makes sense for the given stereo pair? • Which matching function may work better for the given stereo pair? With the results above • Discuss the trade-offs between different hyperparameters on quality and time. • Choose the best hyperparameters and show the corresponding disparity map. • Compare the best disparity map with the ground truth map, discuss the differences and limitations of basic stereo matching. 2 Depth from Disparity (25 pts.) 2.1 Pointcloud Visualization (20 pts.) Implement task2 compute depth map() to convert a disparity map to a depth map, and task2 visualize pointcloud() to save the depth map as pointcloud in ply format for visual￾ization (recommended using MeshLab). For depth map computation, follow the Depth from Disparity part in slides 08. You should try to estimate proper depth scaling constants baseline and focal length to get a better performance. The depth of a pixel p can be formulated as: depth(p) = focal length × baseline disparity(p) (1) For pointcloud conversion, the x and y coordinates of a point should match pixel coordinates in the reference image, and the z coordinate shoule be set to the depth value. You should also set the color of the points to the color of the corresponding pixels the reference image. For better performance, you may need to exclude some outliers in the pointcloud. After finishing the implementation, you can run python main.py --tasks 02 to generate a ply file using the disparity map generated with cv2.StereoBM, saved in the output folder. By modifying the settings of the hyperparameters in task1 simple disparity() and run￾ning python main.py --tasks 12, you can generate pointclouds with your implemented stereo matching algorithm under different settings and they will be saved in the output folder. 2 Computer Vision (2025 fall) Homework 4 2.2 Report (5 pts.) Include in your report and compare the results of the pointclouds generated with • disparity map computed using cv2.StereoBM • disparity map computed using your implemented algorithm under optimal settings you found in task 1. 3 Stereo Matching with Dynamic Programming (15 pts.) 3.1 Algorithm Implementation (10 pts.) Incorporate non-local constraints into your algorithm to improve the quality of the disparity map. Specifically, you are to implement the function task3 compute disparity map dp() with dynamic programming algorithms. You may refer to the Stereo Matching with Dynamic Programming section in lecture slides 08. Note that you should also manage to record the running time of your code, which should be included in the report. After finishing the implementation, you can run python main.py --tasks 3 to generate the disparity map and save it in the output folder. You can also run python main.py --tasks 23 to simultaneously generate pointclouds. 3.2 Report (5 pts.) Report the running time, the disparity map, and the pointcloud generated with dynamic programming algorithm. Compare the results with basic stereo matching algorithm. Submission Requirements • Due date of this homework is November 27, by 23:59. Late submission is acceptable but with a penalty of 10% per day. • Zip your code, report, and all the visualization results (including disparity maps and the pointclouds) into a single file named StuID YourName HW4.zip. A wrong naming format may lead to a penalty of 10%. Make sure that the zip file can be unzipped under Windows.For the code, it should run without errors and can reproduce the results in your report. If you use artificial intelligence tools to help generate codes, explain in your report of (1) how you use them, and (2) the details of implementation in your own words. If your code simultaneously (1) is suspected to be generated by AI tools, and (2) cannot run properly, you may get a penalty of 100%.For the report, either Chinese or English is acceptable. Please submit a single PDF file, which can be exported from LATEX, Word, MarkDown, or any other text editor. You may get a penalty of 10% if the file format is not correct. 3 Computer Vision (2025 fall) Homework 4 Hints Here are some supplemental materials: • cv2.StereoBM: https://docs.opencv.org/4.x/d9/dba/classcv_1_1StereoBM.html • cv2.StereoBeliefPropagation: https://docs.opencv.org/4.x/de/d7a/classcv_1_1cuda _1_1StereoBeliefPropagation.html
最新发布
11-28
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值