Title: | Digital Fortran |
Notice: | Read notes 1.* for important information |
Moderator: | QUARK::LIONEL |
Created: | Thu Jun 01 1995 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 1333 |
Total number of notes: | 6734 |
UNIX 4.0 564 f77 4.1 -92 a fortran program doing fourier transforms runs with different timings - on an AlphaServer8400 5/440 4MB cache ; 4 GByte memory - on a Rawhide 4xev56@400 MHz ; 500 megs memory on the rawhide : nuwigres runs in 43' with 54% cpu due to paging on the tl : nuwigres runs in 62' with 99% cpu ??? same software, output why is the turbolaser slower despite faster cpu's and full cpu utilization ? /Joseph
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
1195.1 | Rawhide has lower memory latency | WIDTH::MDAVIS | Mark Davis - compiler maniac | Tue Feb 25 1997 11:39 | 10 |
but not 3 times lower!? [Were the 43' 62' elapsed time => rawhide 3x faster, or cpu time => rahwide 33% faster ?] Was anything else running on the TL when you ran this? I assume that since you need more than 500meg physical memory, memory speed is key, and the 10% faster cpu has little relevance to the overall speed.... I hope the FFT doesn't have power-of-2 cache thrashing problems. BTW, what's the size of the rawhide cache (I assume it's also 4meg)? | |||||
1195.2 | Is it always like this? | PERFOM::HENNING | Thu Feb 27 1997 05:48 | 24 | |
Rawhide/400 has a 4mb cache Mark mentioned latency. Ditto bandwidth. If the job is only using a single CPU, Rawhide is better at delivering memory bandwidth than is Turbolaser...but not 3x better! http://www.cs.virginia.edu/stream/standard/Bandwidth.html Machine ID ncpus COPY SCALE ADD TRIAD DEC_8400_5-350 1 215.7 207.5 219.6 234.2 DEC_4100_5-400- 1 247.8 243.9 264.9 268.0 Of course, TL has more "headroom" as you add CPUs. If the job is prone to power-of-two problems, then it will show great variability from run-to-run. It will tend to sometimes seem to be stuck in a slow mode, and sometimes seem to run quite nicely. Rebooting the system just before running will either be the best thing you can do for it or the worst. It might like having its pages randomized - and although expensive, letting the vm page reclaim routines run around and muck up your working set would be one way of randomizing. Having other jobs compete for the memory would randomize. Conversely, having plenty of memory lying around idle would tend to let a bad mapping stay bad and never budge. | |||||
1195.4 | The divide is killing you | WIDTH::MDAVIS | Mark Davis - compiler maniac | Thu Feb 27 1997 16:23 | 93 |
I assume you're compiling at least -fast, which permits the divide-by-loop-invariant to become compute inverse outside loop multiply by inverse inside loop. Therefore in your inner loop: DO IEP=1,NRPA EFINL=WPRPA(IEP,KK) EAVER=(EINIT+EFINL)/2.D0 EDEN(IAB,ICD,IQ)=EDEN(IAB,ICD,IQ)+ & XLEFT(IIA,IIC,IEP,KK)*XRIGHT(IIB,IID,IE,KK)*OES(IEP,IE,KK)* & FQQ/(ABSCIS(IQ)+EAVER) ENDDO the divide by 2.0 becomes mult by .5, but the divide by (ABSCIS(IQ)+EAVER) is the dominant cost in the loop. The divide takes ~ 25 cycles. Compiling with f77 -fast -tune ev5 and different switches generates different numbers of cycles per iteration: -unroll 1 52 <normal unroll> 33 -O5 (pipelining) 25 The pipelining case shows that the optimizer does the best job possible: it has to do at least 1 divide per iteration, and since divides can't overlap, then 25 cycles/iteration is the minimum. HOWEVER, the function can be recoded to save LOTS of time. The expression "FQQ/(ABSCIS(IQ)+EAVER)" depends only on the inner 3 loop indices: the arrays involved: WPRPA, ABSCIS, WRPA, WEIGHT are not modified anywhere in the routine. Therefore it is possible to precompute this expr for all values of IQ, IE, IEP Outside of the deep loop nest and store it into a 3 dimension variable. The original routine is computing these values 20*20*20*20 TIMEs more often than necessary. The other thing to do is to use a local scalar var: EDEN1 to accumulate the values into, instead of EDEN(IAB,ICD,IQ), since the indices for this array element don't change inside the inner 2 loops. So at the beginning of the routine you add: dimension fq_ab_eav_inv(325, 325, 64) DO IQ=1,64 QQ=ABSCIS(IQ)/XLAMB FQQ=1.D0/(1.D0+QQ*QQ)**2 FQQ=FQQ*FQQ*ABSCIS(IQ)*WEIGHT(IQ) DO IE=1,NRPA EINIT=WRPA(IE,KK) DO IEP=1,NRPA EFINL=WPRPA(IEP,KK) EAVER=(EINIT+EFINL)/2.D0 fq_ab_eav_inv(iep,ie,iq) = FQQ/(ABSCIS(IQ)+EAVER) enddo enddo enddo and replace the inner 3 loops with: DO IQ=1,64 C EDEN1=0.D0 DO IE=1,NRPA DO IEP=1,NRPA EDEN1=EDEN1+ & XLEFT(IIA,IIC,IEP,KK)*XRIGHT(IIB,IID,IE,KK)*OES(IEP,IE,KK)* & fq_ab_eav_inv(iep,ie,iq) ENDDO ENDDO EDEN(IAB,ICD,IQ)=EDEN1*2.D0/(197.328*PI) ENDDO Now the inner loop has 4 loads, 3 multiplies, and 1 add. The software pipeliner (-fast -O5) does its job and reduces the number of cycles per iteration to 6.5 ! Unfortunately, this transformation will speed up the program on ANYONEs machine, even our competitors (though our divide is slower in terms of number of cycles, because our cycle speed is so fast - so we'll see a larger gain). Mark | |||||
1195.5 | re: .4: slightly safer transformation | WIDTH::MDAVIS | Mark Davis - compiler maniac | Thu Feb 27 1997 16:40 | 17 |
I moved the precomputation of the inverse expression outddise of all the loops. However, there's a bunch of conditional code so it IS possible that the inner loops will never be executed (though I assume this case is unlikely). To avoid doing the precomp unnecessarily, you can instead move the precomp loop inside the fourth loop after all the tests for whether to skip the inner loops, and then put: if (initialized .eq. 0 ) then <precomp loop> initialized = 1 endif and of course put "initialized = 0" at the beginning of the routine... |