Fran Allen: Compilers and Parallel Computing Systems

最新推荐文章于 2020-08-13 17:39:34 发布

转载最新推荐文章于 2020-08-13 17:39:34 发布 · 490 阅读

文章标签：

#parallel #optimization #credentials #application #parameters

杂感同时被 2 个专栏收录

119 篇文章

订阅专栏

并行计算

3 篇文章

订阅专栏

Fran Allen 演讲摘要展示了高性能计算追求每秒浮点运算达到千兆级的目标，以及并行计算系统如何通过多核芯片、跨程序并行化、数据依赖分析等技术实现这一目标。讨论了 C 语言的局限性及并行化挑战，介绍了 VIVEK Sarkar 的编译器研究，强调了选择合适高级语言的重要性。并行计算是提升资源效率的关键，而合适的语言和工具将有助于最大化自动并行优化。

注释：编译自网络，该文对 Fran Allen 关于并行计算系统的演讲作了摘要

今天的高性能计算的伟大目标是要有每秒浮点运算有1千兆（petaflop）的机器。当然这就需要每秒浮点运算1百万千兆（gigaflop）的处理器。她显示，相对引入年份的一个半对数绘图（a semilogplot）峰值速度是线性线条（摩尔定律仍能工作）。

Much of Allen’s work in the 80’s and early 90’s was around the PTRAN system of analysisfor parallelism. The techniques are used, for example in the optimization stageof IBM’s XL family of compilers.

Becausemore and more transistors are being placed on chips, they’re using more and more energy—getting hotter. Part of the solution—which we’re seeing play out—ismulti-core chips. This requires parallelism to achieve the performance users expect. But making use of multi-codes requires that tasks be organized by either users or software to run in parallel.

By 2021,there will be chips with 1024 cores on them. Is parallelism the tool that will make al these ores useful? John Hennessey has calledit the biggest challenge Computer Science has every faced. He has credentials that might make you believe him. Allen says that it’s also the best opportunity that Computer Science has to improve user productivity, application performanceand system integrity.

For parallel (superscalar, etc.) architectures, compilers—software—have been usedto automatically manage scheduling tasks so that they can operate in parallel.What about those techniques will be useful in this new world of multi-cores?

Allen says we need to get rid of C — soon. C, as a language, doesn’t provide enoughinformation to the compiler for it to figure out interdependencies — making it hard to parallelize.Another way to look at it is thatpointers allow programmers to build programs that can’t be easily analyzed to find out which parts of the program can be executed at the same time.

Another factor that makes parallelization hard is data movement.Allen offers no silver bullet.The latency of data movement inhibits high performance.

The key isthe right high level language that can effectively take advantage of the manygood scheduling and pipelining algorithms that exist. If we don’t start withthe right high level language, those techniques will have limited impact.

Shepresents some research from Vivek Sarkar oncompiling for parallelism. Only a small fraction of application developers areexperts in parallelism. Expecting them to become such is unreasonable. Thesoftware is too complex and the primary bottleneck in the usage of parallelsystems. X10 is an example of a language (object oriented) that tries to maximize the amount of automatic parallel optimization that can be done.

Major themesinclude cross-procedure parallelization, data dependency analysis, controldependency analysis, and then using those analyses to satisfy the dependencieswhile maximizing parallelism.

Usefulparallelism depends on the run time behavior of the program (i.e. loopfrequency, branch prediction, and node run times) and the parameters of thetarget multiprocessor. Finding the maximal parallelism isn’t enough because itprobably can’t be efficiently mapped on the multiple cores or processors. Thereis a trade off between the partition cost and the run time. Finding the intersection gives the right level of parallelism—thelevel that is the most efficient use of available resources. Inter procedural analysis is the key to whole program parallelism.

One of the PTRAN analysis techniques was the transform the program into a functional equivalent that used static single assignment.This, of course, is what functional programming enthusiasts have been saying for years:one of functional programming’s biggest advantages is that functional programs—thosewithout mutation—are much more easily parallelized than imperative programs(including imperative-based object oriented languages).

There’s a long list of transformations that canbe done—everything from array padding to get easily handled dimensions to loopunrolling and interleaving. Doing most of these transformations well requires detailed knowledge of the machine—making it a better job for compilers thanhumans. Even then, the speedup is less than the number of additional processors applied on the job. That is, applying 4 processors doesn’t get you a speedup of4—more like 2.2. The speed up—at present—is asymptotic.

(The End)