Unrolling AltiVec, Part 3（转）

最新推荐文章于 2022-01-27 09:24:57 发布

kimsangbaek

最新推荐文章于 2022-01-27 09:24:57 发布

阅读量1.5k

点赞数

分类专栏：藏文阁文章标签： vector performance vectorization optimization assembly function

藏文阁专栏收录该内容

7 篇文章

订阅专栏

Unrolling AltiVec, Part 3: Down and dirty loop optimization

Learn how to tailor your code for AltiVec

Document options

Print this page

E-mail this page

Rate this page

Help us improve this content

Level: Intermediate

Peter Seebach (crankyuser@seebs.plethora.net), Author, Freelance

01 Apr 2005

This series has looked at how the AltiVec instruction set can improve performance on G4 and G5 PowerPC chips. With the theoretical discussion covered in Parts 1 and 2, Part 3 tries to actually get some code optimized. This installment of the Unrolling AltiVec series looks at some real-world code that processors might spend a serious amount of time running, and shows how to tweak it to get extra performance.

About this series

The Unrolling AltiVec series has looked at the history and usage of the SIMD components of the PowerPC architecture; these are called by various names, including Velocity Engine and VMX, but Freescale's name for it is AltiVec, which is how the previous installments refer to it. For more on the history and many names of AltiVec, see the earlier installments of this series. Links are in the Resources section below.

Motorola® AltiVec™ can dramatically improve the performance of many tasks, even tasks that you might initially think are too linear to get much advantage from parallelizing. As always, the most important part of effective optimization is determining where the system is spending most of its time, and focusing your efforts there. For instance, if you were working on an e-mail client, the most useful thing to optimize would be the code for deleting spam. Pay careful attention to the code examples in this article and you should be able to get truly incredible performance.

Since Apple makes the source code for Darwin available, it's possible to even look into optimizing the kernel. One routine comes immediately to mind: as it turns out, it takes as much as 85% of the CPU time on my dual G4 tower. This routine simply updates a counter, then continues operation. It's in the file src/xnu/osfmk/kern/sched_prim.c. Listing 1 shows the complete source:

Listing 1. Where 85% of CPU time goes



	void
	idle_thread(void)
	{
		counter(c_idle_thread_block++);
		thread_block(idle_thread_continue);
		/*NOTREACHED*/
	}

The counter() function is in fact simply a macro that expands to its argument. So, if you focus on the work done in this function, you'll see that all that's really happening is an increment of a value of type mach_counter_t, which is an unsigned int.

Analyzing the code

The most important thing in optimization is understanding what the code actually does. This function can essentially be reduced to the simpler form in Listing 2, once you've removed the inefficient thread-blocking:

Listing 2. A slight simplification



	static int x;
	void
	function(void)
	{
		++x;
	}

On the PowerPC®, this expands into a fairly large chunk of assembly; you can see for yourself using gcc -S -c file.c. The output file, file.s, looks like Listing 3:

Listing 3. Assembly code for a simple function



	_function:
		mflr r0
		bcl 20,31,"L00000000001$pb"
	"L00000000001$pb":
		mflr r10
		mtlr r0
		addis r9,r10,ha16(_x-"L00000000001$pb")
		la r9,lo16(_x-"L00000000001$pb")(r9)
		lwz r2,0(r9)
		addi r2,r2,1
		stw r2,0(r9)
		blr
	.lcomm _x,4,2

Most of this code is setup. (For more on PowerPC assembly, see the Resources section.) The actual work is just done in the three lines right before the return, illustrated in Listing 4:

Listing 4. Incrementing a variable in PowerPC assembly



		lwz r2,0(r9)
		addi r2,r2,1
		stw r2,0(r9)

Load the value, add one to it, and store it back. That's fine as far as it goes -- but remember, the computer is spending something like 85% of its time calling this function over and over. The first thing to do is figure out how long the loop is really taking. I used gettimeofday() to get time stamps before and after running the loop a few times, as shown in Listing 5:

Listing 5. A minimalist test harness



	int
	main(void) {
		struct timezone dontcare = { 0, 0 };
		struct timeval before, after;
		long long microsec;

		gettimeofday(&before, &dontcare);
		for (x = 0; x < 100000000; function())
			;
		gettimeofday(&after, &dontcare);
		microsec = (after.tv_usec - before.tv_usec) +
			1000000 * (after.tv_sec - before.tv_sec);
		printf("%lld microseconds/n", microsec);
		return 0;
	}

The output from a test run is 3197848 microseconds -- a tad over 3 million microseconds. Three seconds is a pretty long time in computing...

Back to top

Unrolling the loop

The first thing to do is just unroll the loop. This is, for a simple loop like this, quite easy, as you can see in Listing 6:

Listing 6. A first pass at unrolling the loop



	void
	function(void)
	{
		++x;
		++x;
		++x;
		++x;
	}

With this simple tweak, processing time is down to 1,332,169 microseconds, a substantial improvement. The function is now better than twice as fast as it was before. It could be unrolled further, but when you're performing the same operation on a 32-bit value four times in a row, that suggests that maybe it'd be more efficient to use AltiVec. (Of course, this first unrolling would improve performance even on a non-AltiVec CPU.)

Back to top

Converting to vector operations

Just as the main processor works by loading, modifying, and then storing values, AltiVec also works by loading, modifying, and then storing values. The difference is that the AltiVec unit works on 16-byte chunks of memory at a time; for instance, it can operate on four 32-bit values at once, instead of a single value at once. The load and store overhead will be a bit high, but getting the values into a vector register should produce a measurable speedup.

Consider Listing 7:

Listing 7. Naive vectorization of addition



typedef vector unsigned int v_u_int;
void
function(void)
{
         unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
         register v_u_int v = vec_ld(0, xs);
         register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
         v = vec_add(v, inc4);
         vec_st(v, 0, xs);
         x = xs[3];
}

The code in Listing 7 initializes an array with the next four values after x, then loads the array into a vector register. Next, the value 4 is loaded into all four 32-bit values of a second register, and the registers are added. Finally, the incremented values are stored back into the array, and the last member of the array is read out. The time to run this version of the loop is only about a million microseconds -- not exactly the huge speedup AltiVec should offer when well-tuned, but it's nice to note that, even without a serious effort at further unrolling, performance has improved.

Back to top

Unrolling further and further

The next step is to unroll a little further. The simplest thing to do is to just repeat the line adding inc4 to v. This gets time down to about 430,000 microseconds. That's pretty good. At this point, however, the dependency of each operation on previous operations is starting to show.

To get past this limit, the code needs to think ahead, and perform more operations at once. That means having more than one set of source and destination registers active at a time, and adding more than four at a time to each value, as Listing 8 shows:

Listing 8. Unrolling further



         unsigned int xs[4] = { x + 1, x + 2, x + 3, x + 4 };
         register v_u_int v1 = vec_ld(0, xs);
         register v_u_int v2, v3, v4, v5, v6, v7, v8;
         register v_u_int inc4 = (vector unsigned int) { 4, 4, 4, 4 };
         register v_u_int inc8 = (vector unsigned int) { 8, 8, 8, 8 };
         register v_u_int inc16 = (vector unsigned int) { 16, 16, 16, 16  
};
         v2 = vec_add(v1, inc4);
         v3 = vec_add(v1, inc8);
         v4 = vec_add(v2, inc8);
         v5 = vec_add(v1, inc16);
         v6 = vec_add(v2, inc16);
         v7 = vec_add(v3, inc16);
         v8 = vec_add(v4, inc16);
         vec_st(v8, 0, xs);
         x = xs[3];

The intent here is to reduce the number of times that the processor needs to stall waiting for the results of a previous operation. So, for instance, before adding anything to v2, the code adds something to v1, storing it in a different register. At the end of these operations, it stores the highest value it has back into x. This brings the time for the main loop down to just over 310,000 microseconds. At this point, the vectorized version is faster than the original, unoptimized, version by a factor of ten.

In fact, in this particular case, it's actually marginally faster just to keep looping on adding the inc4 vector in place; the overhead of setting up all the additional registers ends up costing too much, and running eight adds in a row can trim the time to 250,000 microseconds.

However, once the set-up cost has been paid, it's possible to get a dramatic improvement by making more use of your array of vectors. The four lines filling vectors v5 through v8 can be expanded as Listing 9 shows:

Listing 9. Unrolling the unrolled code even further



         v5 = vec_add(v1, inc16);
         v6 = vec_add(v2, inc16);
         v7 = vec_add(v3, inc16);
         v8 = vec_add(v4, inc16);
         v1 = vec_add(v5, inc16);
         v2 = vec_add(v6, inc16);
         v3 = vec_add(v7, inc16);
         v4 = vec_add(v8, inc16);
         v5 = vec_add(v1, inc16);
         v6 = vec_add(v2, inc16);
         v7 = vec_add(v3, inc16);
         v8 = vec_add(v4, inc16);

This allows you to take full advantage of interleaving, making sure that there are three operations between writing a value in a given vector and using that value in another operation. This gets the execution time down to a fairly zippy 200,000 microseconds.

Back to top

Analyzing performance: Color wheels and color models

Unrolling makes pretty code ugly

One of the reasons not to be too quick to optimize code is that optimized code is often harder to read than unoptimized code. From a code quality standpoint, unrolling is absolutely wrong. It creates multiple copies of the same code, which introduces more opportunities for bit rot. However, unrolled code also executes more quickly.

The same issues often apply to other comparable techniques. Converting a recursive procedure to an iterative one is often difficult and ugly, but may produce a substantial performance increase. In some cases, it may even make it possible to begin vectorizing code that was previously unvectorizable. This comes, in general, at a high cost in maintainability, and a significant risk of introducing errors. For instance, about four bugs were found in the non-vectorized version of the color transformer example you'll see later in the article, and about thirty in the vectorized version.

Unfortunately, while the previous example was an excellent theoretical illustration, it does not work out so well in practice. Because its code runs in the system idle loop, it ends up running the most (and thus getting the largest speedup) when the system is fairly quiet. As the system becomes more heavily loaded, and the system spends less time idle, other functions step in to sabotage the efficiency of the system and make the user aware that performance is lagging. I am talking, as any user of a modern Mac OS X system will well know, of the times that the spinning ball animation, sometimes called the "beach ball of death," appears.

This suggests an avenue for performance optimization: if the generation of that spinning color wheel could be sped up substantially, the performance enhancement would be immediately apparent. The easiest thing to do would be to convert each pixel's RGB values to HSV, increment the hue by some small amount, then convert back to RGB. The AltiVec pixel type is, unfortunately, not particularly useful for this task; what's really needed is a big array of RGB values.

The code for this is a lot more complicated, but it's still reasonably approachable. The starting function, called inc_hsv(), takes a set of input values, converts them to HSV, rotates them, and converts them back. Running this conversion on 65,536 data points takes only a fraction of a second, so the test case is to run that same loop 100 times. That takes a little under 5 seconds. Unrolling this loop will be a little more challenging. In fact, trying to write out the unrolled version would be very painful, especially because the natural size for AltiVec operations on unsigned characters is batches of 16. In the interest of brevity, I have not provided an inline listing of the non-vectorized version of this loop unrolled 16 times. To follow along with this example, see the colors.c sidefile and the color_vec.c sidefile. The original version is colors.c; the vectorized version is color_vec.c.

The sample code provided does make a few assumptions. Most noticeably, it simply assumes that all of the scratch spaces it allocates will be 16-byte aligned. Production code would want to ensure this, possibly by putting objects (such as an array of 16 bytes) in a union with some sort of vector type.

Unfortunately, the algorithm used in this example is a painfully bad one for AltiVec, as it involves division. AltiVec simply does not have a division operator. It has multiplication, and a reciprocal "estimate," but no division. This means that some significant chunks of the process can't be done in vector code. Still, enough of it is vectorized that, even on a first attempt, there's a noticeable improvement: a hundred times through the loop goes from just under 5 seconds, to a bit under 4. That's about a 20% speedup.

Going through the code line-by-line would would soon grow tedious. But a few things are worth pointing out specifically. The AltiVec code gets a huge advantage from having the vec_min() operator. In Listing 10, you can compare the code to select minimum and maximum values in the original program with the AltiVec version:

Listing 10. Getting the maximum of three values



		/* original version */
                 if (r > g) {
                         if (r > b) {
                                 max = r;
                                 if (g > b) {
                                         min = b;
                                 } else {
                                         min = g;
                                 }
                         } else {
                                 max = b;
                                 min = g;
                         }
                 } /* second half of this code omitted */

		/* AltiVec version */
                 max16 = vec_max(vec_max(r16, g16), b16);
                 min16 = vec_min(vec_min(r16, g16), b16);

Experienced programmers might well use a macro to calculate the minimum and maximum values, making the code substantially shorter and easier to read -- but not actually any faster. AltiVec wins dramatically on this.

The AltiVec processor has unpack instructions, which sign-extend values. For instance, unpacking a vector signed char gives you a vector signed short. That's a nice feature; unfortunately, there are no corresponding unsigned operations. One way to deal with this is to use the vec_mul function, and multiply by all ones, as Listing 11 shows:

Listing 11. Unpacking unsigned vectors



	scratch[0] = vec_mule(hue16, u8_1);
	scratch[1] = vec_mulo(hue16, u8_1);

This looks wonderful, but it introduces a bug: the 16 elements of the vector are no longer in order. The first vector holds even-numbered elements, and the second odd-numbered elements. The solution I found for this is the inorder vector, which is used to permute a 16-byte vector such that expanding it out with vec_mule() and vec_mulo() produces the first four values in one vector, the next four values in another, and so on.

One thing that might seem a bit confusing is trying to perform conditional operations. In general, you don't perform conditional operations on an AltiVec unit. You perform operations unconditionally, then select results from vectors based on the output of comparison functions. For instance, in the original code, wraparound of the hue value looks like Listing 12:

Listing 12. Conditional operations without AltiVec



	if (h > 251)
		h -= 251;
	else if (h < 21)
		h += 4;

In the AltiVec code (where hue has a range from 0 to 6), it's done quite differently, as you can see in Listing 13:

Listing 13. Conditional operations with AltiVec



	vector float six;
	vector float sub6;
	vector bool int dosub;

	six = (vector float) { 6.0, 6.0, 6.0, 6.0 };
	dosub = vec_cmpgt(fh[j], six);
	sub6 = vec_sub(fh[j], six);
	fh[j] = vec_sel(fh[j], sub6, dosub);

The dosub vector is filled with 1s for the fields where the vector fh[j] has a value greater than 6. A new vector, sub6 is created by subtracting 6 from every value in the original vector; then fields are selected either from the original vector, or from the modified one, depending on the bits in the comparison vector.

With this in mind, look at the complete vectorized code listing, color_vec.c. There are two pieces of code being handled in loops. The first is the actual hue calculation, which is made more complicated by the need to divide by the delta value for each pixel. It looks like it should be vectorizable; in particular, the essential calculation made each time is identical, with only the source arrays changing.

The second is the assignment of output values, drawing from the p, q, and t vectors. This is not a very large loop, but it's a substantial chunk of code. If it could be vectorized, it would improve things quite a bit. In the first attempt, these calculations were made 16 times, without vectors. Vectorizing the generation of these numbers improved the speedup from 20% to 50%.

The updated code as is runs nearly twice as fast as the original, unvectorized, version. You may be able to improve it even further. Once you've done that, go ahead and put it into the relevant piece of the Darwin kernel and see how much less time you spend waiting for the system to render that endearing beach ball.

Back to top

Summary

As you can see, an experienced programmer can easily avoid the pitfalls of premature optimization, carefully selecting appropriate sections of code to optimize. The AltiVec unit does have occasional shortcomings, such as the lack of a division operator, but careful usage can nonetheless improve performance dramatically. Attention to detail is important, and making sure you understand the code you wish to optimize is absolutely necessary.

Back to top

Resources

See the sample code for the second example in this article: colors.c (the original pre-vectorization code) and color_vec.c (the AltiVec-optimized vectorized code).
Check out the previous parts of the Unrolling AltiVec series.
SIMDtech has working groups involved with various SIMD extensions, including AltiVec, MMX, and others.
Apple's page about the Velocity Engine is a slightly buzzword-heavy description of the AltiVec variants used in Mac systems.
Motorola recently spun off its chipmaking division into a separate company called Freescale. The Freescale site also has resources on AltiVec.
A previous two-part developerWorks article, "TCP/IP checksum vectorization using AltiVec" by Ayal Zaks, Dorit Naishlos, and Daniel Citron, discussed TCP checksum vectorization using AltiVec; Part 2 of the series addresses Scalar optimizations and simple vectorization ( developerWorks, October, November 2004) .
A discussion of throughput vs. latency, on Apple's site, is of particular interest.
Apple provides detailed performance information about the G4, G4+, and G5.
Work is being done on auto-vectorization in GCC.
Crescent Bay Software sells software to automatically vectorize C code.
Apple's page on performance tools gives links to a number of useful tools, including simg4 and simg5.
The GCC Wiki serves as a repository for information about GCC, with up-to-the minute reports on status, useful tips, and everything else you might want.
IBM Senior Processor Architect Peter Sandon discusses vector processing in the G5 in the Ars Technica interview, "The real structure of the VMX (a.k.a. Altivec) unit."
"Save your code from meltdown using PowerPC instructions," Jonathan Rentzsch gets into some of the gritty detail of PowerPC assembly code (developerWorks, November 2004).
For more on the joys and dangers of writing code that directly accesses memory, check out "Data alignment on PowerPC," Jonathan Rentzsch (developerWorks, February 2005).
Learn everything you needed to know about converting color spaces, whether you were afraid to ask or not.
It appears that rumors of the CD-ROM cursor's death have been greatly exaggerated.
Intel plans to release hardware in 2006 to enhance TCP/IP stacks; but did you know you can do something similar right now with AltiVec?
The macstl project tries to unify the architectures in a simple C++ template library. Read more about it in the Slashdot posting Grand Unified Theory of SIMD.
The SIMD Cross-Platform Headers project does something similar.
How does SIMD architecture really work? An ancient but still useful article by Holger Bettag, "Why AltiVec is a Good Thing," has some answers.
This MacTech article about PowerPC assembly is roughly 10 years old, but covers everything discussed here except AltiVec. Isn't stable architecture nice?
And here is another tried and true (by which we mean "old") article that is still relevant: This time, specifically on AltiVec Assembly.
Snopes has more information on April Fools' Pranks.
You will also appreciate this list of the Top 100 April Fool's Day Hoaxes of All Time.

Back to top

About the author


	Peter Seebach uses vector processing a lot, and is personally able to cook up to three eggs at once, making him something of an expert in the field. You can