[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:Digital Fortran
Notice:Read notes 1.* for important information
Moderator:QUARK::LIONEL
Created:Thu Jun 01 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1333
Total number of notes:6734

1256.0. "Help with performance variations F90" by HGOVC::JOELBERMAN () Mon Apr 14 1997 02:29

    Crossposted to wonder::turbolaser
    
    Have a customer who is looking for an explanation for large variations
    in performance when varying step sizes, loop placement, etc using F90
    on an 8CPU(350Mhz) 8 GB 8400.
    
         Could I get some help in explaining these performance trenches?
        The report is in
    http://rubens.its.unimelb.edu.au/~dirk/perf/perf.html
    
T.RTitleUserPersonal
Name
DateLines
1256.1TLE::EKLUNDAlways smiling on the inside!Fri May 02 1997 11:0429
    	Obviously you are not getting much response to this note.  I
    think that while the customer labelled the 8400 5/300 a "Seriously
    Weird Machine", a more appropriate name might be a "Seriously
    Fast CPU".
    
    	In a nutshell, the customer is asking why the performance is
    so "irregular" with changes in loop nesting, loop sizes, etc.
    I think the simple answer is that the processor is disproportionately
    fast compared to other system components (like memory), and hence any
    memory/cache problems become very visible.  If the processor were
    slower, it would be harder to notice these "divots".
    
    	This has two effects: it makes it an interesting exercise to find
    the "sweet spot" for performance (just ask our SPEC experts about their
    experiences...), and once you find it, things run incredibly fast.
    
    	I know that this is not much of an answer to the report you cited,
    but there isn't likely going to be any other answer (unless someone
    wants to describe the wonders of page coloring!?).  We have been aware
    that memory is becoming a more and more serious issue for processors
    of this generation/speed, and have been focusing much energy on how to
    provide better memory/cache/software systems so that the overall
    balance between the processors and memory is better.  Most of the
    report came as no surprise at all - this is really the nature of a
    system with a Seriously Fast CPU...
    
    Cheers!
    Dave Eklund
    
1256.2Some explanations for performance VariationsDECC::MDAVISMark Davis - compiler maniacMon Jun 02 1997 17:03187
The benchmark is:

        do j = 1,100000
           forall(i=1:n:1) y(i) = y(i) + a*x(i)
        enddo

*1 General: this daxpy loop is a tiny benchmark.  If the compiler generates
a different instruction, or in a different order, or the chip responds
slightly differently, so that each trip through the inner loop for one
version varies by a cycle or 2 from a different version, then you might see
a 10% difference and hypothesize a "significant difference".  Most real
programs have more loops and larger loops, where a few cycle difference per
iteration doesn't appear as large differences.  I believe this can explain
much of the variation in test2 and beyond.

If DAXPY is actually all a real program does, then you might 
consider using the DXML (Digital Extended Math Library) product,
which has zillions of standard routines (Blas1,2,3, etc) that are
carefully optimized to be extremely efficient on Alpha.

Also I might warn that at higher optimization levels the compiler could
transform your program by interchanging the 2 loops:

	do i=1,n
	   do j=1,100000
		y(i) = y(i) + a*x(i)
	   enddo
	enddo

and it could then eliminate the inner loop:

	do i=1,n
		y(i) = y(i) + 100000*a*x(i)
	enddo

and would produce a truly Monumental (and bogus :-) MFLOP value.  I.e.,
beware of simple benchmarks.


*2 I could not find an indication of which compiler version was
used, nor the compiler switches.  I assume some level of optimization
was performed (maybe the default, -O4??), because there was 1 graph
with "No optimization".  Different levels of optimization can obviously
have different effects on the performance results.

Comments on Test1:
http://rubens.its.unimelb.edu.au/~dirk/perf/test1.html

* Loop Unrolling *
I assume that the compiler unrolls the daxpy loop 4 times

	do i=1,(n/4)*4,4
		y(i)=y(i)+ a*x(i)
		y(i+1)=y(i+1)+ a*x(i+1)
		y(i+2)=y(i+2)+ a*x(i+2)
		y(i+3)=y(i+3)+ a*x(i+3)
	end do

C       and here's the cleanup loop: 0-3 iterations
	do i=i,n
		y(i)=y(i)+ a*x(i)
	enddo

The unrolled loop runs faster per iteration than the second
loop.  That is why when n is an exact multiple of 4, the 
performance is better, since all the computations are performed
in the efficient loop.

When n is 4x+1, the cleanup loop is executed only once, and the conditional
branch at the end of the loop falls through.  This is the same direction
taken for the previous value of J, so branch prediction "guesses"
correctly.  However, for n = 4x+2, the conditional branch alternates
between "taken" and "not taken", which the branch predictor doesn't handle
as well (I think it may "guess" wrong every  at least half the time).  This
is why n = 4x+1 is close to the n = 4x case, and why n = 4x+2 is less.  
n = 4x+3 is the worst because more of the cleanup loop is used (and less of
the better unrolled loop).  I believe this explains:

	"The variation across fig 5 we have no explanation for. The low
	points seem to correspond to every second odd number"

"Every second odd number" corresponds to 4x+3.

* Constant Bounds Loop Unrolling *

The compiler typically picks 4 as the unroll amount (this can also be set
using the -unroll <n> switch).  However, if the number of iterations is
known at compile time, and it isn't a multiple of 4 but is a multiple of
small integers near 4, then the compiler will unroll by that amount and
won't bother generating the "cleanup" loop.  So a loop with constant bound
9, 15, 18, etc. will run faster than a loop with a "variable" bound of the
same value.

	Secondly, for a constant like 13, the cleanup loop is known to
execute only once, so those instructions can be scheduled among the code
following the loop, reducing its cost over the "variable" bound loop of
length 13.


* Impact of Array Length on Performance *

Now, why is does the performance drop off when array length is near 1024?
As hypothesized, it is related to the direct mapped dcache.

The dcache is 8kbytes, broken up into 256 cache lines of 32 contiguous
bytes.  A real*8 array of 1024 elements is exactly 8k bytes.  (I assume)
the two arrays are declared adjacent to each other (assume x comes before
y), so they EXACTLY overlap: x(i) and y(i) compete for the identical cache
position.  

Furthermore, since each cache line is 32 bytes long, which is 4
real*8 elements, x(i) overlaps with the cache line for 3 other elements of
y near y(i) (e.g., y(i-1),y(i+1),y(i+2); the exact elements depend on where
in the cache line x(i) maps).  Any time the program loads x(i), the cache
line with those 4 elements of y is bumped out of the dcache (back to the
on-chip scache [secondary cache]).

The amount of overlap depends both on the array size, AND the vector
length.  For example, if the array length is 1000, then x(1) overlaps with
y(25).  If the vector length is 10, then there is no "vector" overlap: all
references are satisfied "quickly" from the dcache.  However, if the vector
length is >= 25, then elements y(25)... will overlap and interfere with
x(1)... .  When this vector overlap occurs, then for each iteration of the
outer J loop, the overlapping cache lines will have to be read out of the
scache into the dcache for x and for y: so 2 cache misses for each
overlapping cache line.

This explains the "gentle" slope toward the deep trench: as there is more
vector overlap (due to array length closer to 1024, or longer vector length),
more time is spent doing cache misses so MFLOPs decreases.

* What about the deep trench at array length = 1024? *

Your 3-D graph shows a sharp discontinuity - is something else going on?

I think I can explain it in the following way. First, I assume the
assembler code for the unrolled loop looks like:

	load x(i)
	load x(i+1)
	load x(i+2)
	load x(i+3)
	load y(i)
	load y(i+1)
	load y(i+2)
	load y(i+3)
	4 multiplies
	4 adds
	store y(i)
	store y(i+1)
	store y(i+2)
	store y(i+3)

Second, assume that x(1) (and y(1)) do not start exactly at the beginning
of a cache line.  Therefore, loading the 4 elements of x uses 2 cache lines
- the same 2 cache lines needed by the 4 elements of y.

	Assume we have loaded the 4 values of x (we'll address cache
missing for x in a second).  In order to load the 4 y's, there will be 2
cache misses to wait for, and then the stores of y will hit in the dcache.
For the next iteration i' = i+4 , what happens with the cache lines needed
by x(i+4)..x(i+7)?  The cache line for x(i+4) overlaps with the cache line
for y(i+3) from the previous iteration i, so x(i+4) gets a cache miss.
Furthermore, on the last outer iteration of J, the cache line for x(i+7)
was displaced by the load and store of y(i+7) - so x(i+7) will also have a
cache miss.  This reasoning also applies for iteration i.  Therefore there
are 4 cache misses for each (unrolled) iteration, i.e., there are 4 misses
for every cache line.  This is serious cache thrashing!

It is the 4 misses versus the 2 misses (of the more "distant" overlap) that
explains the performance trench.

The width of the trench is ~7 because there are 7 values of array length
surrounding 1024 for which the x's and y's conflict on the same iteration
i.  This width depends on the unroll amount, the number of elements per
cache line, and the starting cache alignment of x (and y).


Note that Digital offers the KAP Optimizing Preprocessor product which
performs numerous program transformations in order to use the cache
efficiently.  One such transform is to "pad" arrays like x and y so that
their lengths are Not exact multiples of the cache size so they do not
conflict.


Mark Davis
c/c++ team