Unrolling AltiVec, Part 3(转)

Unrolling AltiVec, Part 3: Down and dirty loop optimization

Learn how to tailor your code for AltiVec

developerWorks
Document options
<script language="JavaScript" type="text/javascript"> &amp;amp;amp;amp;lt;!-- document.write('&amp;amp;amp;amp;lt;tr valign="top"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;td width="8"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;img src="//www.ibm.com/i/c.gif" width="8" height="1" alt=""/&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/td&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;td width="16"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;img alt="Set printer orientation to landscape mode" height="16" src="//www.ibm.com/i/v14/icons/printer.gif" width="16" vspace="3" /&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/td&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;td width="122"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;p&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;b&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;a class="smallplainlink" href="javascript:print()"&amp;amp;amp;amp;gt;Print this page&amp;amp;amp;amp;lt;/a&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/b&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/td&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/tr&amp;amp;amp;amp;gt;'); //--&amp;amp;amp;amp;gt; </script>
Set printer orientation to landscape mode

Print this page

<script language="JavaScript" type="text/javascript"> &amp;amp;amp;amp;lt;!-- document.write('&amp;amp;amp;amp;lt;tr valign="top"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;td width="8"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;img src="//www.ibm.com/i/c.gif" width="8" height="1" alt=""/&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/td&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;td width="16"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;img src="//www.ibm.com/i/v14/icons/em.gif" height="16" width="16" vspace="3" alt="Email this page" /&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/td&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;td width="122"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;p&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;a class="smallplainlink" href="javascript:void newWindow()"&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;b&amp;amp;amp;amp;gt;E-mail this page&amp;amp;amp;amp;lt;/b&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/a&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/p&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/td&amp;amp;amp;amp;gt;&amp;amp;amp;amp;lt;/tr&amp;amp;amp;amp;gt;'); //--&amp;amp;amp;amp;gt; </script>
Email this page

E-mail this page


Rate this page

Help us improve this content


Level: Intermediate

Peter Seebach (crankyuser@seebs.plethora.net), Author, Freelance

01 Apr 2005

This series has looked at how the AltiVec instruction set can improve performance on G4 and G5 PowerPC chips. With the theoretical discussion covered in Parts 1 and 2, Part 3 tries to actually get some code optimized. This installment of the Unrolling AltiVec series looks at some real-world code that processors might spend a serious amount of time running, and shows how to tweak it to get extra performance.
About this series

The Unrolling AltiVec series has looked at the history and usage of the SIMD components of the PowerPC architecture; these are called by various names, including Velocity Engine and VMX, but Freescale's name for it is AltiVec, which is how the previous installments refer to it. For more on the history and many names of AltiVec, see the earlier installments of this series. Links are in the Resources section below.

Motorola® AltiVec™ can dramatically improve the performance of many tasks, even tasks that you might initially think are too linear to get much advantage from parallelizing. As always, the most important part of effective optimization is determining where the system is spending most of its time, and focusing your efforts there. For instance, if you were working on an e-mail client, the most useful thing to optimize would be the code for deleting spam. Pay careful attention to the code examples in this article and you should be able to get truly incredible performance.

Since Apple makes the source code for Darwin available, it's possible to even look into optimizing the kernel. One routine comes immediately to mind: as it turns out, it takes as much as 85% of the CPU time on my dual G4 tower. This routine simply updates a counter, then continues operation. It's in the file src/xnu/osfmk/kern/sched_prim.c. Listing 1 shows the complete source:


Listing 1. Where 85% of CPU time goes


	void
	idle_thread(void)
	{
		counter(c_idle_thread_block++);
		thread_block(idle_thread_continue);
		/*NOTREACHED*/
	}

The counter() function is in fact simply a macro that expands to its argument. So, if you focus on the work done in this function, you'll see that all that's really happening is an increment of a value of type mach_counter_t, which is an unsigned int.

Analyzing the code

The most important thing in optimization is understanding what the code actually does. This function can essentially be reduced to the simpler form in Listing 2, once you've removed the inefficient thread-blocking:


Listing 2. A slight simplification


	static int x;
	void
	function(void)
	{
		++x;
	}

On the PowerPC®, this expands into a fairly large chunk of assembly; you can see for yourself using gcc -S -c file.c. The output file, file.s, looks like Listing 3:


Listing 3. Assembly code for a simple function


	_function:
		mflr r0
		bcl 20,31,"L00000000001$pb"
	"L00000000001$pb":
		mflr r10
		mtlr r0
		addis r9,r10,ha16(_x-"L00000000001$pb")
		la r9,lo16(_x-"L00000000001$pb")(r9)
		lwz r2,0(r9)
		addi r2,r2,1
		stw r2,0(r9)
		blr
	.lcomm _x,4,2

Most of this code is setup. (For more on PowerPC assembly, see the Resources section.) The actual work is just done in the three lines right before the return, illustrated in Listing 4:


Listing 4. Incrementing a variable in PowerPC assembly


		lwz r2,0(r9)
		addi r2,r2,1
		stw r2,0(r9)

Load the value, add one to it, and store it back. That's fine as far as it goes -- but remember, the computer is spending something like 85% of its time calling this function over and over. The first thing to do is figure out how long the loop is really taking. I used gettimeofday() to get time stamps before and after running the loop a few times, as shown in Listing 5:


Listing 5. A minimalist test harness


	int
	main(void) {
		struct timezone dontcare = { 0, 0 };
		struct timeval before, after;
		long long microsec;

		gettimeofday(&before, &dontcare);
		for (x = 0; x < 100000000; function())
			;
		gettimeofday(&after, &dontcare);
		microsec = (after.tv_usec - before.tv_usec) +
			1000000 * (after.tv_sec - before.tv_sec);
		printf("%lld microseconds/n", microsec);
		return 0;
	}

The output from a test run is 3197848 microseconds -- a tad over 3 million microseconds. Three seconds is a pretty long time in computing...



Back to top


Unrolling the loop

The first thing to do is just unroll the loop. This is, for a simple loop like this, quite easy, as you can see in Listing 6:


Listing 6. A first pass at unrolling the loop


	void
	function(void)
	{
		++x;
		++x;
		++x;
		++x;
	}

With this simple tweak, processing time is down to 1,332,169 microseconds, a substantial improvement. The function is now better than twice as fast as it was before. It could be unrolled further, but when you're performing the same operation on a 32-bit value four times in a row, that suggests that maybe it'd be more efficient to use AltiVec. (Of course, this first unrolling would improve performance even on a non-AltiVec CPU.)



Back to top


Converting to vector operations

Just as the main processor works by loading, modifying, and then storing values, AltiVec also works by loading, modifying, and then storing values. The difference is that the AltiVec unit works on 16-byte chunks of memory at a time; for instance, it can operate on four 32-bit values at once, instead of a single value at once. The load and store overhead will be a bit high, but getting the values into a vector register should produce a measurable speedup.

Consider Listing 7:


Listing 7. Naive vectorization of addition


typedef vector unsigned int v_u_int;
void
function(void)
{
         unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
         register v_u_int v = vec_ld(0, xs);
         register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
         v = vec_add(v, inc4);
         vec_st(v, 0, xs);
         x = xs[3];
}

The code in Listing 7 initializes an array with the next four values after x, then loads the array into a vector register. Next, the value 4 is loaded into all four 32-bit values of a second register, and the registers are added. Finally, the incremented values are stored back into the array, and the last member of the array is read out. The time to run this version of the loop is only about a million microseconds -- not exactly the huge speedup AltiVec should offer when well-tuned, but it's nice to note that, even without a serious effort at further unrolling, performance has improved.



Back to top


Unrolling further and further

The next step is to unroll a little further. The simplest thing to do is to just repeat the line adding inc4 to v. This gets time down to about 430,000 microseconds. That's pretty good. At this point, however, the dependency of each operation on previous operations is starting to show.

To get past this limit, the code needs to think ahead, and perform more operations at once. That means having more than one set of source and destination registers active at a time, and adding more than four at a time to each value, as Listing 8 shows:


Listing 8. Unrolling further


         unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
         register v_u_int v1 = vec_ld(0, xs);
         register v_u_int v2, v3, v4, v5, v6, v7, v8;
         register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
         register v_u_int inc8 = (vector unsigned int) { 8, 8, 8, 8 };
         register v_u_int inc16 = (vector unsigned int) { 16, 16, 16, 16  
};
         v2 = vec_add(v1, inc4);
         v3 = vec_add(v1, inc8);
         v4 = vec_add(v2, inc8);
         v5 = vec_add(v1, inc16);
         v6 = vec_add(v2, inc16);
         v7 = vec_add(v3, inc16);
         v8 = vec_add(v4, inc16);
         vec_st(v8, 0, xs);
         x = xs[3];

The intent here is to reduce the number of times that the processor needs to stall waiting for the results of a previous operation. So, for instance, before adding anything to v2, the code adds something to v1, storing it in a different register. At the end of these operations, it stores the highest value it has back into x. This brings the time for the main loop down to just over 310,000 microseconds. At this point, the vectorized version is faster than the original, unoptimized, version by a factor of ten.

In fact, in this particular case, it's actually marginally faster just to keep looping on adding the inc4 vector in place; the overhead of setting up all the additional registers ends up costing too much, and running eight adds in a row can trim the time to 250,000 microseconds.

However, once the set-up cost has been paid, it's possible to get a dramatic improvement by making more use of your array of vectors. The four lines filling vectors v5 through v8 can be expanded as Listing 9 shows:


Listing 9. Unrolling the unrolled code even further


         v5 = vec_add(v1, inc16);
         v6 = vec_add(v2, inc16);
         v7 = vec_add(v3, inc16);
         v8 = vec_add(v4, inc16);
         v1 = vec_add(v5, inc16);
         v2 = vec_add(v6, inc16);
         v3 = vec_add(v7, inc16);
         v4 = vec_add(v8, inc16);
         v5 = vec_add(v1, inc16);
         v6 = vec_add(v2, inc16);
         v7 = vec_add(v3, inc16);
         v8 = vec_add(v4, inc16);

This allows you to take full advantage of interleaving, making sure that there are three operations between writing a value in a given vector and using that value in another operation. This gets the execution time down to a fairly zippy 200,000 microseconds.



Back to top


Analyzing performance: Color wheels and color models

Unrolling makes pretty code ugly

One of the reasons not to be too quick to optimize code is that optimized code is often harder to read than unoptimized code. From a code quality standpoint, unrolling is absolutely wrong. It creates multiple copies of the same code, which introduces more opportunities for bit rot. However, unrolled code also executes more quickly.

The same issues often apply to other comparable techniques. Converting a recursive procedure to an iterative one is often difficult and ugly, but may produce a substantial performance increase. In some cases, it may even make it possible to begin vectorizing code that was previously unvectorizable. This comes, in general, at a high cost in maintainability, and a significant risk of introducing errors. For instance, about four bugs were found in the non-vectorized version of the color transformer example you'll see later in the article, and about thirty in the vectorized version.

Unfortunately, while the previous example was an excellent theoretical illustration, it does not work out so well in practice. Because its code runs in the system idle loop, it ends up running the most (and thus getting the largest speedup) when the system is fairly quiet. As the system becomes more heavily loaded, and the system spends less time idle, other functions step in to sabotage the efficiency of the system and make the user aware that performance is lagging. I am talking, as any user of a modern Mac OS X system will well know, of the times that the spinning ball animation, sometimes called the "beach ball of death," appears.

This suggests an avenue for performance optimization: if the generation of that spinning color wheel could be sped up substantially, the performance enhancement would be immediately apparent. The easiest thing to do would be to convert each pixel's RGB values to HSV, increment the hue by some small amount, then convert back to RGB. The AltiVec pixel type is, unfortunately, not particularly useful for this task; what's really needed is a big array of RGB values.

The code for this is a lot more complicated, but it's still reasonably approachable. The starting function, called inc_hsv(), takes a set of input values, converts them to HSV, rotates them, and converts them back. Running this conversion on 65,536 data points takes only a fraction of a second, so the test case is to run that same loop 100 times. That takes a little under 5 seconds. Unrolling this loop will be a little more challenging. In fact, trying to write out the unrolled version would be very painful, especially because the natural size for AltiVec operations on unsigned characters is batches of 16. In the interest of brevity, I have not provided an inline listing of the non-vectorized version of this loop unrolled 16 times. To follow along with this example, see the colors.c sidefile and the color_vec.c sidefile. The original version is colors.c; the vectorized version is color_vec.c.

The sample code provided does make a few assumptions. Most noticeably, it simply assumes that all of the scratch spaces it allocates will be 16-byte aligned. Production code would want to ensure this, possibly by putting objects (such as an array of 16 bytes) in a union with some sort of vector type.

Unfortunately, the algorithm used in this example is a painfully bad one for AltiVec, as it involves division. AltiVec simply does not have a division operator. It has multiplication, and a reciprocal "estimate," but no division. This means that some significant chunks of the process can't be done in vector code. Still, enough of it is vectorized that, even on a first attempt, there's a noticeable improvement: a hundred times through the loop goes from just under 5 seconds, to a bit under 4. That's about a 20% speedup.

Going through the code line-by-line would would soon grow tedious. But a few things are worth pointing out specifically. The AltiVec code gets a huge advantage from having the vec_min() operator. In Listing 10, you can compare the code to select minimum and maximum values in the original program with the AltiVec version:


Listing 10. Getting the maximum of three values


		/* original version */
                 if (r > g) {
                         if (r > b) {
                                 max = r;
                                 if (g > b) {
                                         min = b;
                                 } else {
                                         min = g;
                                 }
                         } else {
                                 max = b;
                                 min = g;
                         }
                 } /* second half of this code omitted */

		/* AltiVec version */
                 max16 = vec_max(vec_max(r16, g16), b16);
                 min16 = vec_min(vec_min(r16, g16), b16);

Experienced programmers might well use a macro to calculate the minimum and maximum values, making the code substantially shorter and easier to read -- but not actually any faster. AltiVec wins dramatically on this.

The AltiVec processor has unpack instructions, which sign-extend values. For instance, unpacking a vector signed char gives you a vector signed short. That's a nice feature; unfortunately, there are no corresponding unsigned operations. One way to deal with this is to use the vec_mul function, and multiply by all ones, as Listing 11 shows:


Listing 11. Unpacking unsigned vectors


	scratch[0] = vec_mule(hue16, u8_1);
	scratch[1] = vec_mulo(hue16, u8_1);

This looks wonderful, but it introduces a bug: the 16 elements of the vector are no longer in order. The first vector holds even-numbered elements, and the second odd-numbered elements. The solution I found for this is the inorder vector, which is used to permute a 16-byte vector such that expanding it out with vec_mule() and vec_mulo() produces the first four values in one vector, the next four values in another, and so on.

One thing that might seem a bit confusing is trying to perform conditional operations. In general, you don't perform conditional operations on an AltiVec unit. You perform operations unconditionally, then select results from vectors based on the output of comparison functions. For instance, in the original code, wraparound of the hue value looks like Listing 12:


Listing 12. Conditional operations without AltiVec


	if (h > 251)
		h -= 251;
	else if (h < 21)
		h += 4;

In the AltiVec code (where hue has a range from 0 to 6), it's done quite differently, as you can see in Listing 13:


Listing 13. Conditional operations with AltiVec


	vector float six;
	vector float sub6;
	vector bool int dosub;

	six = (vector float) { 6.0, 6.0, 6.0, 6.0 };
	dosub = vec_cmpgt(fh[j], six);
	sub6 = vec_sub(fh[j], six);
	fh[j] = vec_sel(fh[j], sub6, dosub);

The dosub vector is filled with 1s for the fields where the vector fh[j] has a value greater than 6. A new vector, sub6 is created by subtracting 6 from every value in the original vector; then fields are selected either from the original vector, or from the modified one, depending on the bits in the comparison vector.

With this in mind, look at the complete vectorized code listing, color_vec.c. There are two pieces of code being handled in loops. The first is the actual hue calculation, which is made more complicated by the need to divide by the delta value for each pixel. It looks like it should be vectorizable; in particular, the essential calculation made each time is identical, with only the source arrays changing.

The second is the assignment of output values, drawing from the p, q, and t vectors. This is not a very large loop, but it's a substantial chunk of code. If it could be vectorized, it would improve things quite a bit. In the first attempt, these calculations were made 16 times, without vectors. Vectorizing the generation of these numbers improved the speedup from 20% to 50%.

The updated code as is runs nearly twice as fast as the original, unvectorized, version. You may be able to improve it even further. Once you've done that, go ahead and put it into the relevant piece of the Darwin kernel and see how much less time you spend waiting for the system to render that endearing beach ball.



Back to top


Summary

As you can see, an experienced programmer can easily avoid the pitfalls of premature optimization, carefully selecting appropriate sections of code to optimize. The AltiVec unit does have occasional shortcomings, such as the lack of a division operator, but careful usage can nonetheless improve performance dramatically. Attention to detail is important, and making sure you understand the code you wish to optimize is absolutely necessary.



Back to top


Resources



Back to top


About the author

Peter Seebach uses vector processing a lot, and is personally able to cook up to three eggs at once, making him something of an expert in the field. You can

内容概要:本文介绍了基于Proteus利用SR锁存器74LS279和与非门74LS02D设计抢答器的方法。文章详细讲解了通过SR锁存器的高、低电平状态实现抢答锁定功能,即S低电平置高电平用于抢答,R低电平置低电平用于清零,确保抢答按键按下后无法再次触发。抢答结果通过数码管显示,其中1、2、3序号分别对应不同的按键,4路抢答器独立显示序号,3路抢答器则通过或逻辑处理并列抢答的情况,如1和2、1和3、2和3序号并列抢答时显示“3”,并用LED指示具体并列情况。电路还集成了声光反馈机制,当抢答有效时,蜂鸣器发出声音提示。此外,文章提到可通过三刀双掷开关切换是否允许并列抢答,以及通过锁存器实现抢答器路数的扩展,并提出加入倒计时功能进一步完善系统。 适合人群:对数字电路设计有一定兴趣的电子爱好者,尤其是希望深入了解锁存器应用及抢答器设计原理的初学者和中级工程师。 使用场景及目标:①学习如何利用SR锁存器实现抢答器的基本功能;②掌握并列抢答情况下的电路处理方法;③理解数码管、LED、蜂鸣器等元件在电路中的应用;④探索电路的可扩展性和改进方向。 阅读建议:读者应结合Proteus软件进行实际电路仿真,动手搭建文中所述电路,逐步理解每个组件的作用及其相互关系,同时关注电路中可能出现的问题及解决方法。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值