[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:Digital Fortran
Notice:Read notes 1.* for important information
Moderator:QUARK::LIONEL
Created:Thu Jun 01 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1333
Total number of notes:6734

1195.0. "performance issue" by RTOMS::PARETIJ () Tue Feb 25 1997 10:23

UNIX 4.0 564 f77 4.1 -92

a fortran program doing fourier transforms runs with different timings

- on an AlphaServer8400 5/440 4MB cache ; 4 GByte memory
- on a Rawhide 4xev56@400 MHz ; 500 megs memory

on the rawhide : nuwigres runs in 43' with 54% cpu due to paging
on the tl      : nuwigres runs in 62' with 99% cpu ??? same software, output


why is the turbolaser slower despite faster cpu's and full cpu utilization ?

/Joseph
T.RTitleUserPersonal
Name
DateLines
1195.1Rawhide has lower memory latencyWIDTH::MDAVISMark Davis - compiler maniacTue Feb 25 1997 11:3910
but not 3 times lower!?  [Were the 43' 62' elapsed time => rawhide 3x faster,
or cpu time => rahwide 33% faster  ?]
Was anything else running on the TL when you
ran this?  I assume that since you need more than 500meg physical
memory, memory speed is key, and the 10% faster cpu has little relevance
to the overall speed....

I hope the FFT doesn't have power-of-2 cache thrashing problems.

BTW, what's the size of the rawhide cache (I assume it's also 4meg)?
1195.2Is it always like this?PERFOM::HENNINGThu Feb 27 1997 05:4824
    Rawhide/400 has a 4mb cache
    
    Mark mentioned latency.  Ditto bandwidth.  If the job is only using a
    single CPU, Rawhide is better at delivering memory bandwidth than is
    Turbolaser...but not 3x better!
    
    http://www.cs.virginia.edu/stream/standard/Bandwidth.html
    
    Machine ID                ncpus    COPY    SCALE      ADD    TRIAD
    DEC_8400_5-350               1    215.7    207.5    219.6    234.2
    DEC_4100_5-400-              1    247.8    243.9    264.9    268.0
    
    Of course, TL has more "headroom" as you add CPUs.
    
    If the job is prone to power-of-two problems, then it will show great
    variability from run-to-run.  It will tend to sometimes seem to be
    stuck in a slow mode, and sometimes seem to run quite nicely. 
    Rebooting the system just before running will either be the best thing
    you can do for it or the worst.  It might like having its pages
    randomized - and although expensive, letting the vm page reclaim
    routines run around and muck up your working set would be one way of
    randomizing.  Having other jobs compete for the memory would randomize. 
    Conversely, having plenty of memory lying around idle would tend to let
    a bad mapping stay bad and never budge.
1195.4The divide is killing youWIDTH::MDAVISMark Davis - compiler maniacThu Feb 27 1997 16:2393
I assume you're compiling at least -fast, which permits the 
divide-by-loop-invariant to become
	compute inverse outside loop
	multiply by inverse inside loop.

Therefore in your inner loop:

                        DO IEP=1,NRPA
      EFINL=WPRPA(IEP,KK)
      EAVER=(EINIT+EFINL)/2.D0
      EDEN(IAB,ICD,IQ)=EDEN(IAB,ICD,IQ)+
     & XLEFT(IIA,IIC,IEP,KK)*XRIGHT(IIB,IID,IE,KK)*OES(IEP,IE,KK)*
     &                   FQQ/(ABSCIS(IQ)+EAVER)
                        ENDDO

the divide by 2.0 becomes mult by .5, but the divide by (ABSCIS(IQ)+EAVER)
is the dominant cost in the loop.  The divide takes ~ 25 cycles.

Compiling with f77 -fast -tune ev5 and different switches generates
different numbers of cycles per iteration:

	-unroll 1		52	
	<normal unroll>		33
	-O5 (pipelining)	25

The pipelining case shows that the optimizer does the best job possible:
it has to do at least 1 divide per iteration, and since divides can't
overlap, then 25 cycles/iteration is the minimum.


HOWEVER, the function can be recoded to save LOTS of time.  The expression
	"FQQ/(ABSCIS(IQ)+EAVER)" depends only on the inner 3 loop
indices: the arrays involved: WPRPA, ABSCIS, WRPA, WEIGHT are not
modified anywhere in the routine.  Therefore it is possible to
precompute this expr for all values of IQ, IE, IEP Outside of the deep
loop nest and store it into a 3 dimension variable.  The original
routine is computing these values  20*20*20*20 TIMEs more often than
necessary.
	The other thing to do is to use a local scalar var: EDEN1
to accumulate the values into, instead of EDEN(IAB,ICD,IQ), since
the indices for this array element don't change inside the inner 2 loops.

So at the beginning of the routine you add:

      dimension fq_ab_eav_inv(325, 325, 64)
      DO IQ=1,64
         QQ=ABSCIS(IQ)/XLAMB
         FQQ=1.D0/(1.D0+QQ*QQ)**2
         FQQ=FQQ*FQQ*ABSCIS(IQ)*WEIGHT(IQ)
         DO IE=1,NRPA
            EINIT=WRPA(IE,KK)
            DO IEP=1,NRPA
               EFINL=WPRPA(IEP,KK)
               EAVER=(EINIT+EFINL)/2.D0
               fq_ab_eav_inv(iep,ie,iq) = FQQ/(ABSCIS(IQ)+EAVER)
            enddo
         enddo
      enddo


and replace the inner 3 loops with:
                DO IQ=1,64



C
      EDEN1=0.D0
      DO IE=1,NRPA

                        DO IEP=1,NRPA


      EDEN1=EDEN1+
     & XLEFT(IIA,IIC,IEP,KK)*XRIGHT(IIB,IID,IE,KK)*OES(IEP,IE,KK)*
     &                   fq_ab_eav_inv(iep,ie,iq)

                        ENDDO
                ENDDO
      EDEN(IAB,ICD,IQ)=EDEN1*2.D0/(197.328*PI)
             ENDDO


Now the inner loop has 4 loads, 3 multiplies, and 1 add.  The software
pipeliner (-fast -O5) does its job and reduces the number of cycles
per iteration to 6.5 !


Unfortunately, this transformation will speed up the program on ANYONEs
machine, even our competitors (though our divide is slower in terms of
number of cycles, because our cycle speed is so fast - so we'll see a
larger gain).

Mark
1195.5re: .4: slightly safer transformationWIDTH::MDAVISMark Davis - compiler maniacThu Feb 27 1997 16:4017
I moved the precomputation of the inverse expression outddise
of all the loops.  However, there's a bunch of conditional
code so it IS possible that the inner loops will never
be executed (though I assume this case is unlikely).

To avoid doing the precomp unnecessarily, you can instead move
the precomp loop inside the fourth loop after all the tests
for whether to skip the inner loops, and then put:

	if (initialized .eq. 0 ) then
		<precomp loop>
		initialized = 1
	endif


and of course put "initialized = 0" at the beginning of the 
routine...