[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:Digital Fortran
Notice:Read notes 1.* for important information
Moderator:QUARK::LIONEL
Created:Thu Jun 01 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1333
Total number of notes:6734

1165.0. "critical loop" by RTOMS::PARETIJ () Fri Feb 07 1997 05:19

In a Max-planck program (Germany ) most of the work is done in the following
loops

  937          integer*4       i, j, k, l, m, n
   938  c
   939          do l = 1, nwlexc
   940              m = dataset(2,l)
   941              do j = 1, nwldet(l)
   942                  do k = 1, ntimepnt(m)
   943                      yfit(k,j,l) = 0.0
   944                      do n = 1, nsub
   945                          do i = 1, ntau
   946                              yfit(k,j,l) = yfit(k,j,l)
   947          1                       + das(j,i,l,n) * exp(-tim(k,m)/tau(i,n))
   948                          end do
   949                      end do
   950                  end do
   951              end do
   952          end do
   953  c

I suspect that the compiler doesn't do the best on it. Any idea ?
Thnaks
/Joseph
T.RTitleUserPersonal
Name
DateLines
1165.1Need more infoWIBBIN::NOYCEPulling weeds, pickin' stonesFri Feb 07 1997 10:048
Can you give us some declarations?  For example, are tim(*,*) and
tau(*,*) arrays or functions?  What datatype do they use?  What are
the expected values of nsub, ntau, ntimept(*), nwldet(*)?  

At first glance, it seems you would do better by moving the j loop to
be innermost, so that exp(...k,m,i,n) can be moved outside the loop.
This looks like it may have more payoff (~100 cycles per call) than
reordering the loops for optimal memory behavior.
1165.2version?TLE::C_STOCKSCheryl StocksFri Feb 07 1997 10:305
What operating system and version are you running this program on?  On Unix,
several versions back, there was a severe performance problem when doing
exp on some (very large negative?) values.

				cheryl
1165.3TLE::EKLUNDAlways smiling on the inside!Fri Feb 07 1997 10:3813
    	It would be useful to have the complete routine, and
    even better would be a runnable program (hopefully small).
    While Bill did not mention this, we are assuming that you
    compile with things like -O5 and -fast (/opt=lev=5/fast).
    Reordering the loops is probably the big winner.  This
    may cause your results to be slightly different, depending
    upon the sensitivity of the computation (you are computing
    sums of values - interchanging the loops will change the
    order of the summation which my affect the final result).
    
    Cheers!
    Dave Eklund
    
1165.4thanks & more questionsRTOMS::PARETIJMon Feb 10 1997 03:55149
>Can you give us some declarations?  For example, are tim(*,*) and
>tau(*,*) arrays or functions?  What datatype do they use?  What are
>the expected values of nsub, ntau, ntimept(*), nwldet(*)?  

tim and tau are real*4 arrays
yfit is a real array (because I use -r8 as a comp. option it will be real*8)


>At first glance, it seems you would do better by moving the j loop to
>be innermost, so that exp(...k,m,i,n) can be moved outside the loop.
>This looks like it may have more payoff (~100 cycles per call) than
>reordering the loops for optimal memory behavior.

GREAT ! the elapsed time for the run drops from 54" to 34" !!!

>What operating system and version are you running this program on?  On Unix,
>several versions back, there was a severe performance problem when doing
>exp on some (very large negative?) values.

digital unix 4.0 564  ; f77 v4.1-92
I use f77 -O5 -fast -tune host -speculate_all -r8

I also did manual blocking on Bill's code, by introducing additional
loops to have the l,i,n,j,k loops varying on a range sized nb, and I found 
the best result (about 28" ) using nb=20 and compiling with
f77 -O4 -fast -tune host  -r8
(Here -O5 slows it down to one minute, perhaps because it interferes
with the manual unrolling ?? )
By the way I was kind of surprised that the min. lies on nb=20;
it is an nb**5 domain (l, k, j, i, n) there are 2 doubles (crap and yfit )
and 4 real*4 (tau, tim, das ). To fit D-cache it should go like this :
      8000 = (8+4+4+4+8)* (nb**5) 
which should give nb=3 ???

Here follows the modified subroutine (i.e. Bill's idea+blocking )

Thanks to all of you, in particular Bill for the grand idea!
-Joseph

	subroutine calc_decay
c
c include global values
c
	include	'kfdimdef.inc'
	include	'kfgbldef.inc'
c
c define variables
c
c        real*8 crap(DIM_TAU,DIM_SUB,DIM_TIMEPNT)
         real*8 crap
	integer*4	i, j, k, l, m, n
c
      nb =20 
      do l = 1, nwlexc
            m = dataset(2,l)
       do k=1,ntimepnt(m)
          do j=1,nwldet(l)
                  yfit(k,j,l) = 0.0
          end do
       end do
      end do 
c
c       
      do ll=1,nwlexc,nb
       do l=ll,min(nwlexc,ll+nb-1)
            m = dataset(2,l)
         do kk=1, ntimepnt(m),nb
          do nn=1,nsub,nb
           do ii=1,ntau,nb 
            do jj=1,nwldet(l),nb
             do k=kk,min( ntimepnt(m),kk+nb-1)
               do n=nn,min(nsub,nn+nb-1)
                do i=ii,min(ntau,ii+nb-1)
                        crap=exp(-tim(k,m)/tau(i,n))
                  do j = jj,min(nwldet(l),jj+nb-1)
			    yfit(k,j,l) = yfit(k,j,l)
     1                 + das(j,i,l,n) * crap 
                  end do                               ! end do j
                end do                                 ! end do i
               end do                                  ! end do n
             end do                                    ! end do k
            end do                                     !end do jj
           end do                                        ! end do ii
          end do                                   ! end do nn
         end do                                   ! end do kk
       end do                                   ! end do l
      end do                                       ! end do ll
c
	return
	end

c
c  Filename:		kfgbldef.inc
c  Author:		I. Martin
c  Date:		V1.0	01-dec-1995
c  Description:		global values
c
c -----
c
c define variables in common blocks
c
	integer*4	ierr, lenlis
	character*80	fillis, fillst
	integer*4	mode, nabs, nflu, dataset(DIM_SASTYP,DIM_WLEXC),
	1		nwlexc, nwldet(DIM_WLEXC), mwldet(DIM_SASTYP), ntau,
	1		nsub, nsas(DIM_TAU,DIM_SASTYP),
	1		ngf(DIM_TAU,DIM_SASTYP), ntimepnt(DIM_SASTYP)
	real*4		wlexc(DIM_WLEXC),
	1		wldet(DIM_WLDET*DIM_WLEXC,DIM_SASTYP)
	integer*4	inddas(DIM_WLDET,DIM_WLEXC,DIM_SASTYP)
	real		amplinpsum(DIM_WLDET,DIM_WLEXC)
	logical*4	exccalc
	real*4		excfacdec(DIM_WLDET,DIM_WLEXC),
	1		prob(DIM_TAU), problb(DIM_TAU), probub(DIM_TAU)
	integer*4	probfit(DIM_TAU)
	real*4		ratmat(DIM_TAU,DIM_TAU,DIM_SUB),
	1		ratlb(DIM_TAU,DIM_TAU), ratub(DIM_TAU,DIM_TAU)
	integer*4	ratfit(DIM_TAU,DIM_TAU)
	real*4		excfac(DIM_WLEXC), excrel(DIM_TAU,DIM_WLEXC),
	1		excmat(DIM_TAU,DIM_WLEXC,DIM_SUB),
	1		exclb(DIM_WLEXC), excub(DIM_WLEXC)
	integer*4	excfit(DIM_WLEXC)
	integer*4	sasmode
	real*4		gfamp(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfpos(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfhw(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfint(DIM_TAU,DIM_SASTYP),	! intercept and
	1		gfsl(DIM_TAU,DIM_SASTYP),	! slope for subground
	1		gfamplb(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfposlb(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfhwlb(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfintlb(DIM_TAU,DIM_SASTYP),
	1		gfsllb(DIM_TAU,DIM_SASTYP),
	1		gfampub(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfposub(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfhwub(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfintub(DIM_TAU,DIM_SASTYP),
	1		gfslub(DIM_TAU,DIM_SASTYP)
	integer*4	gfampfit(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfposfit(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfhwfit(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfintfit(DIM_TAU,DIM_SASTYP),
	1		gfslfit(DIM_TAU,DIM_SASTYP)
	real*4		gfbeg(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfend(DIM_GF,DIM_TAU,DIM_SASTYP),
	1		gfbegl(DIM_TAU,DIM_SASTYP), gfendl(DIM_TAU,DIM_SASTYP),
	1		sas(DIM_WLDET*DIM_WLEXC,DIM_TAU,DIM_SASTYP),
	1		saslb(DIM_TAU,DIM_SASTYP), sasub(DIM_TAU,DIM_SASTYP)
	integer*4	sasfit(DIM_TAU,DIM_SASTYP), g
1165.5best result with SGI codeRTOMS::PARETIJMon Feb 10 1997 05:0557
I had the best result  using the following silicon-graphics 
code, compiled with -O5 -fast -tune host -speculate-all -r8

It is about 20 "
------------------------------------------------
	subroutine calc_decay
c
c include global values
c
	include	'kfdimdef.inc'
	include	'kfgbldef.inc'
c
c define variables
c
	integer*4	i, j, k, l, m, n
c
        real*4 tau_inv(DIM_SUB,DIM_TAU)
        real*4 exp_vec(DIM_SASTYP,DIM_TIMEPNT,DIM_SUB,DIM_TAU)
        real*4 exp_arg(DIM_TIMEPNT,DIM_SUB,DIM_TAU)
        do n = 1, nsub
           do i = 1, ntau
              tau_inv(n,i) = 1.0/tau(i,n)
           enddo
        enddo
        do l = 1, nwlexc
            m = dataset(2,l)
            do k = 1, ntimepnt(m)
                do n = 1, nsub
                    do i = 1, ntau
                        exp_vec(m,k,n,i) = exp(-tim(k,m)*tau_inv(n,i))
                    end do
                end do
            end do
        end do
	do l = 1, nwlexc
	    m = dataset(2,l)
	    do j = 1, nwldet(l)
		do k = 1, ntimepnt(m)
		    yfit(k,j,l) = 0.0
Cum		    do n = 1, nsub
Cum			do i = 1, ntau
Cum			    yfit(k,j,l) = yfit(k,j,l)
Cum	1			+ das(j,i,l,n) * exp(-tim(k,m)/tau(i,n))
Cum			end do
Cum		    end do
		    do n = 1, nsub
			do i = 1, ntau
			    yfit(k,j,l) = yfit(k,j,l)
	1			+ das(j,i,l,n) * exp_vec(m,k,n,i)
			end do
		    end do
		end do
	    end do
	end do
c
	return
	end
1165.6TLE::EKLUNDAlways smiling on the inside!Mon Feb 10 1997 09:4113
    	I didn't see the declaration for yfit.  Since many of
    the declarations are real*4, they will NOT be affected by
    your -r8 switch.  Only those variables declared as real
    (or implicitly declared by starting letter) are affected
    by the -r8 switch.  You might want to check what is
    intended and whether you can get away with using real*4
    for the yfit array as well, which would result in less
    memory traffic.  This could help, especially if the
    array is large.
    
    Cheers!
    Dave Eklund
    
1165.7HPCGRP::MANLEYMon Feb 10 1997 10:1258
Joseph,

I'd be curious to know how the code below performs. It does essentially the
same thing your best code does, but uses less memory and accesses temporary
data arrays contiguously.

Also, if there's any way you can redefine "das" so that memory is accessed
contiguously, [ make das(j,i,l,n) into das(i,n,j,l) ] do it. It can only
help.

	- Dwight -

--------------------------------------------------------------------------------

	subroutine calc_decay
c
c include global values
c
	include	'kfdimdef.inc'
	include	'kfgbldef.inc'
c
c define variables
c
	integer*4	i, j, k, l, m, n
c
        real*4 tau_inv(DIM_TAU,DIM_SUB)
        real*4 exp_vec(DIM_TAU,DIM_SUB)
        real*4 temp
c
        do n = 1, nsub
           do i = 1, ntau
              tau_inv(i,n) = 1.0/tau(i,n)
           enddo
        enddo
        do l = 1, nwlexc
            m = dataset(2,l)
            do k = 1, ntimepnt(m)
               do n = 1, nsub
                  do i = 1, ntau
                     exp_vec(i,n) = exp(-tim(k,m)*tau_inv(i,n))
                  end do
               end do
	       do j = 1, nwldet(l)
		    temp = 0.0
		    do n = 1, nsub
			do i = 1, ntau
			    temp = temp + das(j,i,l,n) * exp_vec(i,n)
			end do
		    end do
		    yfit(k,j,l) = temp
		end do
	    end do
	end do
c
	return
	end

1165.8re. 1165.7RTOMS::PARETIJMon Feb 10 1997 10:354
Dwight,
with your code the elapsed time is ~22 " (just slightly more than 
using the silicon graphics code in 1165.5
/Joseph
1165.9HPCGRP::MANLEYMon Feb 10 1997 11:1611
Joseph,

> with your code the elapsed time is ~22 " (just slightly more than 
> using the silicon graphics code in 1165.5

Perhaps temp and exp_vec should be declared as real instead of real*4.
Might as well eliminate conversions.

	- Dwight -

1165.10re .9RTOMS::PARETIJMon Feb 10 1997 11:474
with this additional change the elapsed time is the same as on the sgi code
thanks Dwight

/Joseph
1165.11HPCGRP::MANLEYMon Feb 10 1997 13:017
> with this additional change the elapsed time is the same as on the sgi code

This seems strange, the memory access pattern should be much better than the
SGI version. At any rate, you need to change the declaration of and references
to "das" throughout the program to get further improvement.