| A lot depends on what's in the "..."
1. usually there's a divide by sqrt of DR. You want to use
-fast to permit this to become a reverse_sqrt.
2. usually X(*,I) and X(*,J) get modified to reflect the force
each exerts on the other , depending on the distance between them.
These look messy to the compiler because it doesn't know that
I .ne. J. You might copy X(*,I) into a temp array XI(*) before
the loop, reference it instead of X(*,I) inside the loop, and copy
it back into X(*,I) after the loop.
3. I would expect DIM to be only 3: you could unroll the DO 110 loop.
If you don't unroll it, the compiler will guess that it can unroll it
at least 4 times, and maybe software pipeline it - but this will be
a waste if DIM is only 3...
4. How much memory gets swept over by X(*,J) for these 3 nested loops?
If the memory is < 90k, then everything will probably hit in at least
the scache. If the range is larger, several meg, then X(1,J) will be
a long pause to fetch the value from memory. You could try doing a
prefetch of the X(1,next_j).
I'll assume that in the inner loop, once you get a J .gt. I,
you'll probably get some more (but I don't rely on it).
J = GRID(XCELL,YCELL,ZCELL)
1 IF (J.GT.I) THEN
c Safe value to use for prefetch: I
next_j = i
if (xcell .lt. MAXXCELL(1)) then
next_j = GRID(XCELL+1,YCELL,ZCELL)
endif
if (next_j .gt. i) then
prefetch (x(1,next_j)
endif
...
It would be nicer not to have to check "if (xcell .lt. MAXXCELL(1))"
|