您的位置:首页 > 其它

Chapter 5-03

2015-11-15 00:29 302 查看
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1

5.8 Loop Unrolling

Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.

Function psum2 (Figure 5.1), where each iteration computes two elements of the prefix sum, thereby halving the total number of iterations required.

Loop unrolling can improve performance in two ways.

First, it reduces the number of operations that do not contribute directly to the program result, such as loop indexing and conditional branching.

Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

We can unroll a loop by any factor k. To do so, we set the upper limit to be n − k + 1, and within the loop apply the combining operation to elements i through i + k − 1. Loop index i is incremented by k in each iteration. The maximum array index i + k − 1 will then be less than n. We include the second loop to step through the final few elements of the vector one at a time. The body of this loop will be executed between 0 and k − 1 times.

Several phenomena contribute to these measured values of CPE. This result can be attributed to the benefits of reducing loop overhead operations. By reducing the number of overhead operations relative to the number of additions required to compute the vector sum, we can reach the point where the one-cycle latency of integer addition becomes the performance-limiting factor.

The mulss instructions each get translated into two operations: one to load an array element from memory, and one to multiply this value by the accumulated value.

Register %xmm0 gets read and written twice in each execution of the loop. We can simplify and abstract this graph, to obtain the template shown in Figure 5.19(b). We then replicate this template n/2 times to show the computation for a vector of length n, obtaining the data-flow representation shown in Figure 5.20..

There is still a critical path of n mul operations in this graph—there are half as many iterations, but each iteration has two multiplication operations in sequence. Since the critical path was the limiting factor for the performance of the code without loop unrolling, it remains so with simple loop unrolling.

5.9 Enhancing Parallelism

The functional units performing addition and multiplication are all fully pipelined, meaning that they can start new operations every clock cycle. Our code cannot take advantage of this capability because we are accumulating the value as a single variable acc. We cannot compute a new value for acc until the preceding computation has completed. Even though the functional unit can start a new operation every clock cycle, it will only start one every L cycles, where L is the latency of the combining operation.

5.9.1 Multiple Accumulators

For a combining operation that is associative and commutative, such as integer addition or multiplication, we can improve performance by splitting the set of combining operations into two or more parts and combining the results at the end.

Let Pn denote the product of elements a0, a1, … , an−1. Assuming n is even, we can also write this as Pn = PEn × POn , where PEn is the product of the elements with even indices, and POn is the product of the elements with odd indices:

Figure 5.21: It uses both two-way loop unrolling and two-way parallelism, accumulating elements with even index in variable acc0 and elements with odd index in variable acc1. Finally apply the combining operation to acc0 and acc1 to compute the final result.

Figure 5.22 demonstrates the effect of applying this transformation to achieve k-way loop unrolling and k-way parallelism for values up to k = 6. For integer multiplication, and for the floating-point operations, we see a CPE value of L/k, where L is the latency of the operation, up to the throughput bound of 1.00.

As with combine5, the inner loop contains two mulss operations, but these instructions translate into mul operations that read and write separate registers, with no data dependency between them (Figure 5.24(b)). We then replicate this template n/2 times (Figure 5.25), modeling the execution of the function on a vector of length n.

We see that we now have two critical paths, one corresponding to computing the product of even-numbered elements (program value acc0) and one for the odd-numbered elements (program value acc1). Each of these critical paths contain only n/2 operations, thus leading to a CPE of 4.00/2. We are exploiting the pipelining capabilities of the functional unit to increase their utilization by a factor of 2. When we apply this transformation for larger values of k, we find that we cannot reduce the CPE below 1.00. Once we reach this point, several of the functional units are operating at maximum capacity.

Two’s-complement arithmetic is commutative and associative, even when overflow occurs. So for an integer data type, the result computed by combine6 will be identical to that computed by combine5 under all possible conditions. Thus, an optimizing compiler could potentially convert the code shown in combine4 first to a two-way unrolled variant of combine5 by loop unrolling, and then to that of combine6 by introducing parallelism. Many compilers do loop unrolling automatically, but relatively few then introduce this form of parallelism.

But floating-point multiplication and addition are not associative when overflow occurs. Thus, combine5 and combine6 could produce different results due to rounding or overflow. Suppose a product computation in which all of the elements with even indices were numbers with very large absolute value, while those with odd indices were very close to 0.0. So product PEn might overflow, or POn might underflow, even though computing product Pn proceeds normally. For most applications, achieving a performance gain of 2× outweighs the risk of generating different results for strange data patterns. A program developer should check with potential users to see if there are particular conditions that may cause the revised algorithm to be unacceptable.

Please indicate the source: http://blog.csdn.net/gaoxiangnumber1.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: