Fran Allen: Compilers and Parallel Computing Systems

Fran Allen 演讲摘要展示了高性能计算追求每秒浮点运算达到千兆级的目标,以及并行计算系统如何通过多核芯片、跨程序并行化、数据依赖分析等技术实现这一目标。讨论了 C 语言的局限性及并行化挑战,介绍了 VIVEK Sarkar 的编译器研究,强调了选择合适高级语言的重要性。并行计算是提升资源效率的关键,而合适的语言和工具将有助于最大化自动并行优化。

注释:编译自网络,该文对 Fran Allen 关于并行计算系统的演讲作了摘要


今天的高性能计算的伟大目标是要有每秒浮点运算有1千兆(petaflop 的机器。当然这就需要每秒浮点运算1百万千兆(gigaflop)的处理器。她显示,相对引入年份的一个半对数绘图(a semilogplot)峰值速度是线性线条(摩尔定律仍能工作)。


Much of Allen’s work in the 80’s and early 90’s was around the PTRAN system of analysisfor parallelism. The techniques are used, for example in the optimization stageof IBM’s XL family of compilers.

Becausemore and more transistors are being placed on chips, they’re using more and more energy—getting hotter. Part of the solution—which we’re seeing play out—ismulti-core chips. This requires parallelism to achieve the performance users expect. But making use of multi-codes requires that tasks be organized by either users or software to run in parallel.


By 2021,there will be chips with 1024 cores on them. Is parallelism the tool that will make al these ores useful?
 John Hennessey has calledit the biggest challenge Computer Science has every faced. He has credentials that might make you believe him. Allen says that it’s also the best opportunity that Computer Science has to improve user productivity, application performanceand system integrity.


For parallel (superscalar, etc.) architectures, compilers—software—have been usedto automatically manage scheduling tasks so that they can operate in parallel.What about those techniques will be useful in this new world of multi-cores?


Allen says we need to get rid of C — soon. C, as a language, doesn’t provide enoughinformation to the compiler for it to figure out interdependencies — making it hard to parallelize.
Another way to look at it is thatpointers allow programmers to build programs that can’t be easily analyzed to find out which parts of the program can be executed at the same time.


Another factor that makes parallelization hard is data movement.Allen offers no silver bullet.The latency of data movement inhibits high performance.


The key isthe right high level language that can effectively take advantage of the manygood scheduling and pipelining algorithms that exist. If we don’t start withthe right high level language, those techniques will have limited impact.

Shepresents some research from
 Vivek Sarkar oncompiling for parallelism. Only a small fraction of application developers areexperts in parallelism. Expecting them to become such is unreasonable. Thesoftware is too complex and the primary bottleneck in the usage of parallelsystems. X10 is an example of a language (object oriented) that tries to maximize the amount of automatic parallel optimization that can be done.



Major themesinclude cross-procedure parallelization, data dependency analysis, controldependency analysis, and then using those analyses to satisfy the dependencieswhile maximizing parallelism.

Usefulparallelism depends on the run time behavior of the program (i.e. loopfrequency, branch prediction, and node run times) and the parameters of thetarget multiprocessor. Finding the maximal parallelism isn’t enough because itprobably can’t be efficiently mapped on the multiple cores or processors. Thereis a trade off between the partition cost and the run time. Finding the intersection gives the
 right level of parallelism—thelevel that is the most efficient use of available resources. Inter procedural analysis is the key to whole program parallelism.


One of the PTRAN analysis techniques was the transform the program into a functional equivalent that used static single assignment.
This, of course, is what functional programming enthusiasts have been saying for years:one of functional programming’s biggest advantages is that functional programs—thosewithout mutation—are much more easily parallelized than imperative programs(including imperative-based object oriented languages).


There’s a long list of transformations that canbe done—everything from array padding to get easily handled dimensions to loopunrolling and interleaving. Doing most of these transformations well requires detailed knowledge of the machine—making it a better job for compilers thanhumans. Even then, the speedup is less than the number of additional processors applied on the job. That is, applying 4 processors doesn’t get you a speedup of4—more like 2.2. The speed up—at present—is asymptotic.


(The  End)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值