| The benchmark is:
do j = 1,100000
forall(i=1:n:1) y(i) = y(i) + a*x(i)
enddo
*1 General: this daxpy loop is a tiny benchmark. If the compiler generates
a different instruction, or in a different order, or the chip responds
slightly differently, so that each trip through the inner loop for one
version varies by a cycle or 2 from a different version, then you might see
a 10% difference and hypothesize a "significant difference". Most real
programs have more loops and larger loops, where a few cycle difference per
iteration doesn't appear as large differences. I believe this can explain
much of the variation in test2 and beyond.
If DAXPY is actually all a real program does, then you might
consider using the DXML (Digital Extended Math Library) product,
which has zillions of standard routines (Blas1,2,3, etc) that are
carefully optimized to be extremely efficient on Alpha.
Also I might warn that at higher optimization levels the compiler could
transform your program by interchanging the 2 loops:
do i=1,n
do j=1,100000
y(i) = y(i) + a*x(i)
enddo
enddo
and it could then eliminate the inner loop:
do i=1,n
y(i) = y(i) + 100000*a*x(i)
enddo
and would produce a truly Monumental (and bogus :-) MFLOP value. I.e.,
beware of simple benchmarks.
*2 I could not find an indication of which compiler version was
used, nor the compiler switches. I assume some level of optimization
was performed (maybe the default, -O4??), because there was 1 graph
with "No optimization". Different levels of optimization can obviously
have different effects on the performance results.
Comments on Test1:
http://rubens.its.unimelb.edu.au/~dirk/perf/test1.html
* Loop Unrolling *
I assume that the compiler unrolls the daxpy loop 4 times
do i=1,(n/4)*4,4
y(i)=y(i)+ a*x(i)
y(i+1)=y(i+1)+ a*x(i+1)
y(i+2)=y(i+2)+ a*x(i+2)
y(i+3)=y(i+3)+ a*x(i+3)
end do
C and here's the cleanup loop: 0-3 iterations
do i=i,n
y(i)=y(i)+ a*x(i)
enddo
The unrolled loop runs faster per iteration than the second
loop. That is why when n is an exact multiple of 4, the
performance is better, since all the computations are performed
in the efficient loop.
When n is 4x+1, the cleanup loop is executed only once, and the conditional
branch at the end of the loop falls through. This is the same direction
taken for the previous value of J, so branch prediction "guesses"
correctly. However, for n = 4x+2, the conditional branch alternates
between "taken" and "not taken", which the branch predictor doesn't handle
as well (I think it may "guess" wrong every at least half the time). This
is why n = 4x+1 is close to the n = 4x case, and why n = 4x+2 is less.
n = 4x+3 is the worst because more of the cleanup loop is used (and less of
the better unrolled loop). I believe this explains:
"The variation across fig 5 we have no explanation for. The low
points seem to correspond to every second odd number"
"Every second odd number" corresponds to 4x+3.
* Constant Bounds Loop Unrolling *
The compiler typically picks 4 as the unroll amount (this can also be set
using the -unroll <n> switch). However, if the number of iterations is
known at compile time, and it isn't a multiple of 4 but is a multiple of
small integers near 4, then the compiler will unroll by that amount and
won't bother generating the "cleanup" loop. So a loop with constant bound
9, 15, 18, etc. will run faster than a loop with a "variable" bound of the
same value.
Secondly, for a constant like 13, the cleanup loop is known to
execute only once, so those instructions can be scheduled among the code
following the loop, reducing its cost over the "variable" bound loop of
length 13.
* Impact of Array Length on Performance *
Now, why is does the performance drop off when array length is near 1024?
As hypothesized, it is related to the direct mapped dcache.
The dcache is 8kbytes, broken up into 256 cache lines of 32 contiguous
bytes. A real*8 array of 1024 elements is exactly 8k bytes. (I assume)
the two arrays are declared adjacent to each other (assume x comes before
y), so they EXACTLY overlap: x(i) and y(i) compete for the identical cache
position.
Furthermore, since each cache line is 32 bytes long, which is 4
real*8 elements, x(i) overlaps with the cache line for 3 other elements of
y near y(i) (e.g., y(i-1),y(i+1),y(i+2); the exact elements depend on where
in the cache line x(i) maps). Any time the program loads x(i), the cache
line with those 4 elements of y is bumped out of the dcache (back to the
on-chip scache [secondary cache]).
The amount of overlap depends both on the array size, AND the vector
length. For example, if the array length is 1000, then x(1) overlaps with
y(25). If the vector length is 10, then there is no "vector" overlap: all
references are satisfied "quickly" from the dcache. However, if the vector
length is >= 25, then elements y(25)... will overlap and interfere with
x(1)... . When this vector overlap occurs, then for each iteration of the
outer J loop, the overlapping cache lines will have to be read out of the
scache into the dcache for x and for y: so 2 cache misses for each
overlapping cache line.
This explains the "gentle" slope toward the deep trench: as there is more
vector overlap (due to array length closer to 1024, or longer vector length),
more time is spent doing cache misses so MFLOPs decreases.
* What about the deep trench at array length = 1024? *
Your 3-D graph shows a sharp discontinuity - is something else going on?
I think I can explain it in the following way. First, I assume the
assembler code for the unrolled loop looks like:
load x(i)
load x(i+1)
load x(i+2)
load x(i+3)
load y(i)
load y(i+1)
load y(i+2)
load y(i+3)
4 multiplies
4 adds
store y(i)
store y(i+1)
store y(i+2)
store y(i+3)
Second, assume that x(1) (and y(1)) do not start exactly at the beginning
of a cache line. Therefore, loading the 4 elements of x uses 2 cache lines
- the same 2 cache lines needed by the 4 elements of y.
Assume we have loaded the 4 values of x (we'll address cache
missing for x in a second). In order to load the 4 y's, there will be 2
cache misses to wait for, and then the stores of y will hit in the dcache.
For the next iteration i' = i+4 , what happens with the cache lines needed
by x(i+4)..x(i+7)? The cache line for x(i+4) overlaps with the cache line
for y(i+3) from the previous iteration i, so x(i+4) gets a cache miss.
Furthermore, on the last outer iteration of J, the cache line for x(i+7)
was displaced by the load and store of y(i+7) - so x(i+7) will also have a
cache miss. This reasoning also applies for iteration i. Therefore there
are 4 cache misses for each (unrolled) iteration, i.e., there are 4 misses
for every cache line. This is serious cache thrashing!
It is the 4 misses versus the 2 misses (of the more "distant" overlap) that
explains the performance trench.
The width of the trench is ~7 because there are 7 values of array length
surrounding 1024 for which the x's and y's conflict on the same iteration
i. This width depends on the unroll amount, the number of elements per
cache line, and the starting cache alignment of x (and y).
Note that Digital offers the KAP Optimizing Preprocessor product which
performs numerous program transformations in order to use the cache
efficiently. One such transform is to "pad" arrays like x and y so that
their lengths are Not exact multiples of the cache size so they do not
conflict.
Mark Davis
c/c++ team
|