[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:	Digital Fortran
Notice:	Read notes 1.* for important information
Moderator:	QUARK::LIONEL

Created:	Thu Jun 01 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1333
Total number of notes:	6734

1256.0. "Help with performance variations F90" by HGOVC::JOELBERMAN () Mon Apr 14 1997 01:29

    Crossposted to wonder::turbolaser
    
    Have a customer who is looking for an explanation for large variations
    in performance when varying step sizes, loop placement, etc using F90
    on an 8CPU(350Mhz) 8 GB 8400.
    
         Could I get some help in explaining these performance trenches?
        The report is in
    http://rubens.its.unimelb.edu.au/~dirk/perf/perf.html

T.R	Title	User	Personal Name	Date	Lines
1256.1		TLE::EKLUND	Always smiling on the inside!	`Fri May 02 1997 10:04`	29
	Obviously you are not getting much response to this note. I think that while the customer labelled the 8400 5/300 a "Seriously Weird Machine", a more appropriate name might be a "Seriously Fast CPU". In a nutshell, the customer is asking why the performance is so "irregular" with changes in loop nesting, loop sizes, etc. I think the simple answer is that the processor is disproportionately fast compared to other system components (like memory), and hence any memory/cache problems become very visible. If the processor were slower, it would be harder to notice these "divots". This has two effects: it makes it an interesting exercise to find the "sweet spot" for performance (just ask our SPEC experts about their experiences...), and once you find it, things run incredibly fast. I know that this is not much of an answer to the report you cited, but there isn't likely going to be any other answer (unless someone wants to describe the wonders of page coloring!?). We have been aware that memory is becoming a more and more serious issue for processors of this generation/speed, and have been focusing much energy on how to provide better memory/cache/software systems so that the overall balance between the processors and memory is better. Most of the report came as no surprise at all - this is really the nature of a system with a Seriously Fast CPU... Cheers! Dave Eklund
1256.2	Some explanations for performance Variations	DECC::MDAVIS	Mark Davis - compiler maniac	`Mon Jun 02 1997 16:03`	187
	The benchmark is: do j = 1,100000 forall(i=1:n:1) y(i) = y(i) + ax(i) enddo 1 General: this daxpy loop is a tiny benchmark. If the compiler generates a different instruction, or in a different order, or the chip responds slightly differently, so that each trip through the inner loop for one version varies by a cycle or 2 from a different version, then you might see a 10% difference and hypothesize a "significant difference". Most real programs have more loops and larger loops, where a few cycle difference per iteration doesn't appear as large differences. I believe this can explain much of the variation in test2 and beyond. If DAXPY is actually all a real program does, then you might consider using the DXML (Digital Extended Math Library) product, which has zillions of standard routines (Blas1,2,3, etc) that are carefully optimized to be extremely efficient on Alpha. Also I might warn that at higher optimization levels the compiler could transform your program by interchanging the 2 loops: do i=1,n do j=1,100000 y(i) = y(i) + ax(i) enddo enddo and it could then eliminate the inner loop: do i=1,n y(i) = y(i) + 100000ax(i) enddo and would produce a truly Monumental (and bogus :-) MFLOP value. I.e., beware of simple benchmarks. 2 I could not find an indication of which compiler version was used, nor the compiler switches. I assume some level of optimization was performed (maybe the default, -O4??), because there was 1 graph with "No optimization". Different levels of optimization can obviously have different effects on the performance results. Comments on Test1: http://rubens.its.unimelb.edu.au/~dirk/perf/test1.html * Loop Unrolling * I assume that the compiler unrolls the daxpy loop 4 times do i=1,(n/4)4,4 y(i)=y(i)+ ax(i) y(i+1)=y(i+1)+ ax(i+1) y(i+2)=y(i+2)+ ax(i+2) y(i+3)=y(i+3)+ ax(i+3) end do C and here's the cleanup loop: 0-3 iterations do i=i,n y(i)=y(i)+ ax(i) enddo The unrolled loop runs faster per iteration than the second loop. That is why when n is an exact multiple of 4, the performance is better, since all the computations are performed in the efficient loop. When n is 4x+1, the cleanup loop is executed only once, and the conditional branch at the end of the loop falls through. This is the same direction taken for the previous value of J, so branch prediction "guesses" correctly. However, for n = 4x+2, the conditional branch alternates between "taken" and "not taken", which the branch predictor doesn't handle as well (I think it may "guess" wrong every at least half the time). This is why n = 4x+1 is close to the n = 4x case, and why n = 4x+2 is less. n = 4x+3 is the worst because more of the cleanup loop is used (and less of the better unrolled loop). I believe this explains: "The variation across fig 5 we have no explanation for. The low points seem to correspond to every second odd number" "Every second odd number" corresponds to 4x+3. * Constant Bounds Loop Unrolling * The compiler typically picks 4 as the unroll amount (this can also be set using the -unroll <n> switch). However, if the number of iterations is known at compile time, and it isn't a multiple of 4 but is a multiple of small integers near 4, then the compiler will unroll by that amount and won't bother generating the "cleanup" loop. So a loop with constant bound 9, 15, 18, etc. will run faster than a loop with a "variable" bound of the same value. Secondly, for a constant like 13, the cleanup loop is known to execute only once, so those instructions can be scheduled among the code following the loop, reducing its cost over the "variable" bound loop of length 13. * Impact of Array Length on Performance * Now, why is does the performance drop off when array length is near 1024? As hypothesized, it is related to the direct mapped dcache. The dcache is 8kbytes, broken up into 256 cache lines of 32 contiguous bytes. A real8 array of 1024 elements is exactly 8k bytes. (I assume) the two arrays are declared adjacent to each other (assume x comes before y), so they EXACTLY overlap: x(i) and y(i) compete for the identical cache position. Furthermore, since each cache line is 32 bytes long, which is 4 real8 elements, x(i) overlaps with the cache line for 3 other elements of y near y(i) (e.g., y(i-1),y(i+1),y(i+2); the exact elements depend on where in the cache line x(i) maps). Any time the program loads x(i), the cache line with those 4 elements of y is bumped out of the dcache (back to the on-chip scache [secondary cache]). The amount of overlap depends both on the array size, AND the vector length. For example, if the array length is 1000, then x(1) overlaps with y(25). If the vector length is 10, then there is no "vector" overlap: all references are satisfied "quickly" from the dcache. However, if the vector length is >= 25, then elements y(25)... will overlap and interfere with x(1)... . When this vector overlap occurs, then for each iteration of the outer J loop, the overlapping cache lines will have to be read out of the scache into the dcache for x and for y: so 2 cache misses for each overlapping cache line. This explains the "gentle" slope toward the deep trench: as there is more vector overlap (due to array length closer to 1024, or longer vector length), more time is spent doing cache misses so MFLOPs decreases. * What about the deep trench at array length = 1024? * Your 3-D graph shows a sharp discontinuity - is something else going on? I think I can explain it in the following way. First, I assume the assembler code for the unrolled loop looks like: load x(i) load x(i+1) load x(i+2) load x(i+3) load y(i) load y(i+1) load y(i+2) load y(i+3) 4 multiplies 4 adds store y(i) store y(i+1) store y(i+2) store y(i+3) Second, assume that x(1) (and y(1)) do not start exactly at the beginning of a cache line. Therefore, loading the 4 elements of x uses 2 cache lines - the same 2 cache lines needed by the 4 elements of y. Assume we have loaded the 4 values of x (we'll address cache missing for x in a second). In order to load the 4 y's, there will be 2 cache misses to wait for, and then the stores of y will hit in the dcache. For the next iteration i' = i+4 , what happens with the cache lines needed by x(i+4)..x(i+7)? The cache line for x(i+4) overlaps with the cache line for y(i+3) from the previous iteration i, so x(i+4) gets a cache miss. Furthermore, on the last outer iteration of J, the cache line for x(i+7) was displaced by the load and store of y(i+7) - so x(i+7) will also have a cache miss. This reasoning also applies for iteration i. Therefore there are 4 cache misses for each (unrolled) iteration, i.e., there are 4 misses for every cache line. This is serious cache thrashing! It is the 4 misses versus the 2 misses (of the more "distant" overlap) that explains the performance trench. The width of the trench is ~7 because there are 7 values of array length surrounding 1024 for which the x's and y's conflict on the same iteration i. This width depends on the unroll amount, the number of elements per cache line, and the starting cache alignment of x (and y). Note that Digital offers the KAP Optimizing Preprocessor product which performs numerous program transformations in order to use the cache efficiently. One such transform is to "pad" arrays like x and y so that their lengths are Not exact multiples of the cache size so they do not conflict. Mark Davis c/c++ team