Cray Reminiscences

本文回顾了Cray超级计算机的优化技术,并探讨了这些技术如何影响现代Java应用程序的性能。文章强调了理解硬件特性的重要性,并提供了具体示例说明如何通过调整代码风格来提高程序效率。
Kirk Pepperdine's attendence of AMD's performance talk at JavaOne produced a cascade of fascinating memories about Cray optimizations. Here, Kirk relates some of the most interesting optimizations that helped make Cray's superfast - and how that relates to your Java programs.
Published July 2007, Author Kirk Pepperdine

 

Traditionally JavaONE has offered more performance related talks than any other Java conference. This year was no exception, with so many performance related talks it was impossible to attend all of them. One of the more interesting performance related sessions was put on by AMD's Azeem Jiva. The timeless theme of the talk was: make sure your programs are good to your hardware and your hardware will be good to you. I say timeless because as I watched Azeem stroll through the demos, my mind was deluged with memories of my days programming on Cray supercomptuers.

The Cray series of super-computers were an engineering marvel when they were in their prime. The brilliance in the machine architecture wasn't only about speed. The scalar processors were not much faster than what would be found on any other server. Cray's brilliance was about the balance within the machine. As they saw it, there is no point in having a superfast CPU if it was only going to be starved for work. So, much of the extreme engineering that took place was in making sure that the CPUs were never hung on wait conditions.

One of my long time recomendations for Windows users to eliminate virtual memory from their machines (don't do this unless you've got plenty of real RAM) is based on the lack of virtual memory on Cray systems. In a time when memory was both in short supply and expensive, Cray recognized that getting data from a disk created huge wait conditions. So they eliminated virtual memory. To help with the I/O they introduced the use of solid state memory devices and multiple separate channels to move data from one place to another.

Most of these optimizations were performed under the hood and, aside from a few rule of thumbs such as don't do I/O and process in the same loop, one's coding style had little effect on performance. That said, there were other optimizations that could be obliterated if the developer ignored or didn't understand how the underlying hardware was architected and functioned. Out of the many optimizations that a developers coding style had the direct ability to affect, I'd like to mention three. These were: instruction buffer faults; striding through memory; and the ability to utilize the vector processors.

Though rare at the time, some form of the technologies found in Cray's vector processors are now commonplace in modern day processors. For example, pipelining intermediate results through various stages of computation so that the processor can work on multiple pieces of data at the same time is quite common. Things like path prediction are much more advanced now then they were at when I programmed Crays. Back in the late 80s, early 90s, it was fairly easy (and it still is) to obfuscate what you may want to do next. In the worst case, Cray would run your code in scalar mode instead of being able to utilize the much faster and more effecient vector processors. The most common way to obfuscate was to put branch statements in a for loop (vector processors worked best with large for loops). In order to get code to vectorize, one would often separate the data based on the condition in the branch prior to entering the processing loop. Each dataset would then be run through it's own separate loop with the branch removed.

Cray's instruction buffer was big enough to hold 40 instructions. The system would load the next 40 instructions to be executed and when they were exhausted it would load the next 40. It did have the ability to do a predictive pre-fetch but in general, fetching the next set of instructions would most likely be a hold condition (CPU goes hungry). This is yet another case where a developer's coding style could have adverse effects on performance. Of course code that randomly jumped to instructions not in the buffer would have the biggest impact on performance, but there were more subtle conditions than that. Again loops become important. Loops that were larger than 40 instructions and those that spilled over an instruction buffer boundary would result in some (sometimes significant) performance degradation. The obvious solution for the former problem was to write very small tight loops even if that meant looping twice over the same dataset. Crays were very well tuned for doing this, so quite often several single passes worked much better than a single "do all" pass over the dataset.

In retrospect the latter problem should have been handled automatically by having the optimizer align loops on instruction buffer boundries. Cray's solution at the time was to introduce a pragma statement. The pragma told the compiler/linker to align the code following the statement on an instruction buffer boundry. The programmers role in all of this, other than recognizing where to put the pragma's, is to ensure that loops do not span more than 40 instructions. Done right, a couple of short loops will outperform a single loop that does everything.

The most interesting optimization was to support the feeding of the vector processor from main memory. The vector processor was capable of both accepting a single piece of data and returning a result all in the same clock tick. The electronic reality is that memory, once strobed to be read, requires some time before it can be read again. Cray was always careful to make sure that the bank cool off time was 4 clock cycles. They were also careful to arrange memory into 4 different banks and, rather than have contiguous memory in the same bank, adjacent memory locations will arranged in different memory banks. The consequence of this design is that one bank of memory will always be ready to be read.

The developers responsibility in this case is to ensure that any strides through memory hit the cold bank on every clock tick. To do this, you may have to adjust the data structures being used. Again coding style counts. So by now you may be asking, what does all of this have to do with Java? The answer is: more that one would think there to be.

Right now you may be wondering why on earth anyone would be interested in hardware level counters when they are looking at Java. After all, Java runs in an abstraction commonly known as the Java Virtual Machine which places some distance between our code and the hardware. Aside from taking care with our choice of algorithm, what could you possibly do aside from implementing some dangerous premature optimizations that would affect how our code utilizes the underlying hardware? Surpisingly there are some easy changes you can make to your coding style that should help you to better utilize your hardware. More surpisingly, these style optimizations have been with us for longer than Java has.

The style optimization pointed to by Azeem was in respect to striding through a doubly indexed array.

The example presented looked something like

public void transform( int[][] matrix) {
   int j = 0;
   int k = 0;
   for ( ; k <  matrix[ j].length; k++) {
       for ( ; j <  matrix.length; j++) {
           do stuff
       }
   }

 

According to the JLS, arrays are evaluated from left to right. So we can write int[3][] matrix and follow that up with matrix[0] = new int[ 3];. This implies that it is the right most index that will point to a single dimensional array whose elements will be held in a contiguous block of memory. So the above code "jumps" through memory creating a situation that thrashes the CPU's onboard cache. Of course the fix is to reverse the for loops so that the code is running through memory in a more predictable manner. Now this example is a toy so the problem is quite obvious. The question is: do you have some obfsucated code lurking in your application that is doing the same thing?

Another important feature the analyzer was able to detect was lock contention. Lock contention can have some pretty devastating effects on your application's ability to perform. Aside from starving threads from obtaining the CPU, lock contention puts pressure on the operating system. Even more interesting was watching Azeem using the Analyzer to point out how disruptive it was to the processor as well. What I got from this demo is that just as the Cray processors worked best when our code worked in a predictable manner, so too do our modern processors. And, there is nothing quite as disruptive to a processor as having to execute code to acquire a lock. This isn't to say that we shouldn't when we need to, but it does suggest that something that I've known to be true in the past is still true today even in Java, namely that your coding style can have positive effects on performance.

Azeem's talk at JavaOne was TS-9363, "Java Platform Performance on Multicore: Better Performance or Bigger Headache?" Related tools are Intel's VTune and AMD's Analyzer.

下载前可以先看下教程 https://pan.quark.cn/s/a426667488ae 标题“仿淘宝jquery图片左右切换带数字”揭示了这是一个关于运用jQuery技术完成的图片轮播机制,其特色在于具备淘宝在线平台普遍存在的图片切换表现,并且在整个切换环节中会展示当前图片的序列号。 此类功能一般应用于电子商务平台的产品呈现环节,使用户可以便捷地查看多张商品的照片。 说明中的“NULL”表示未提供进一步的信息,但我们可以借助标题来揣摩若干核心的技术要点。 在构建此类功能时,开发者通常会借助以下技术手段:1. **jQuery库**:jQuery是一个应用广泛的JavaScript框架,它简化了HTML文档的遍历、事件管理、动画效果以及Ajax通信。 在此项目中,jQuery将负责处理用户的点击动作(实现左右切换),并且制造流畅的过渡效果。 2. **图片轮播扩展工具**:开发者或许会采用现成的jQuery扩展,例如Slick、Bootstrap Carousel或个性化的轮播函数,以达成图片切换的功能。 这些扩展能够辅助迅速构建功能完善的轮播模块。 3. **即时数字呈现**:展示当前图片的序列号,这需要通过JavaScript或jQuery来追踪并调整。 每当图片切换时,相应的数字也会同步更新。 4. **CSS美化**:为了达成淘宝图片切换的视觉效果,可能需要设计特定的CSS样式,涵盖图片的排列方式、过渡效果、点状指示器等。 CSS3的动画和过渡特性(如`transition`和`animation`)在此过程中扮演关键角色。 5. **事件监测**:运用jQuery的`.on()`方法来监测用户的操作,比如点击左右控制按钮或自动按时间间隔切换。 根据用户的交互,触发相应的函数来执行...
垃圾实例分割数据集 一、基础信息 • 数据集名称:垃圾实例分割数据集 • 图片数量: 训练集:7,000张图片 验证集:426张图片 测试集:644张图片 • 训练集:7,000张图片 • 验证集:426张图片 • 测试集:644张图片 • 分类类别: 垃圾(Sampah) • 垃圾(Sampah) • 标注格式:YOLO格式,包含实例分割的多边形点坐标,适用于实例分割任务。 • 数据格式:图片文件 二、适用场景 • 智能垃圾检测系统开发:数据集支持实例分割任务,帮助构建能够自动识别和分割图像中垃圾区域的AI模型,适用于智能清洁机器人、自动垃圾桶等应用。 • 环境监控与管理:集成到监控系统中,用于实时检测公共区域的垃圾堆积,辅助环境清洁和治理决策。 • 计算机视觉研究:支持实例分割算法的研究和优化,特别是在垃圾识别领域,促进AI在环保方面的创新。 • 教育与实践:可用于高校或培训机构的AI课程,作为实例分割技术的实践数据集,帮助学生理解计算机视觉应用。 三、数据集优势 • 精确的实例分割标注:每个垃圾实例都使用详细的多边形点进行标注,确保分割边界准确,提升模型训练效果。 • 数据多样性:包含多种垃圾物品实例,覆盖不同场景,增强模型的泛化能力和鲁棒性。 • 格式兼容性强:YOLO标注格式易于与主流深度学习框架集成,如YOLO系列、PyTorch等,方便研究人员和开发者使用。 • 实际应用价值:直接针对现实世界的垃圾管理需求,为自动化环保解决方案提供可靠数据支持,具有重要的社会意义。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值