[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:Digital Fortran
Notice:Read notes 1.* for important information
Moderator:QUARK::LIONEL
Created:Thu Jun 01 1995
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1333
Total number of notes:6734

1324.0. "Customer Q's F77 vs KAP" by NAMIX::jpt (FIS and Chips) Mon Jun 02 1997 04:35


	One customer asked about SAXPY performance, and my advise was
	to use KAP, which really exploits the performance of Alpha in
	SAXPY. Well now customer is happy with this, BUT he has some
	more or less political questions, and I'd like to hear 
	"right" answers to customer's concerns:

	regards,

		jari

 -------
Subject: 
       SUMMARY: how is your fp performance?
  Date: 
       Fri, 30 May 1997 19:20:59 +0300
  From: 
       vbormc::[email protected] (MAIL-11 Daemon)
    To: 
       [email protected]


Hello and thanks to all responses.

The answer lies in loop unrolling a la KAP. Nick Hill at 
Rutherford Labs has supplied me with a KAP version of the
SAXPY that produces a fabulous 960MFLOPS. I have left this
code for you all to try at...
http://www.gre.ac.uk/~k.mcmanus/saxpy.kap.f
and the original
http://www.gre.ac.uk/~k.mcmanus/saxpy.f
is still there for comparison.
A point of interest is that the KAP version doubles the FLOP
rate on Sun machines.

This raises some intriguing questions.....

1       How come the FLOP rate is more than twice the clock rate
        of 466MHz??

2       Why can the compiler not manage this elementary transform??

3       Is this a conspiracy to raise royalty for Kuck??

4       Has anybody compared this rather poor compiler performance
        against the impressive SGI v6 compiler??

5       Why did DEC not tell me before buying Half a million
        bucks of kit that without KAP it would run like a three
        legged dog??

Answers on an email please to

[email protected]  -  http://www.gre.ac.uk/~k.mcmanus
-------------------------------------------------------------
Dr Kevin McManus                     ||
School of Computing & Math Science   ||
The University of Greenwich          ||
Wellington St.  Woolwich             ||Tel +44 (0)181 331 8719 
London SE18 6PF  UK                  ||Fax +44 (0)181 331 8665

T.RTitleUserPersonal
Name
DateLines
1324.1Not measuring what you thinkWIBBIN::NOYCEPulling weeds, pickin' stonesMon Jun 02 1997 11:2554
KAP took this loop:

	DO J = 1, N
	  DO I = 1, M(K)
	    B(I) = ALPHA*A(I) + B(I)
	  END DO
	END DO

which is intended to measure N executions of SAXPY, each of length M(K),
and turned it into (approximately) this:

	DO I=1,M(K)
	  T1 = B(I)
	  T2 = ALPHA*A(I)
	  DO J = 1, N
	    T1 = T1 + T2
	  ENDDO
	  B(I) = T1
	ENDDO

This does N SAXPY's at once -- but only if they are all adding the same
multiple of the same vectors together.  Surely this isn't representative
of your real application.

The KAP-transformed program performs significantly less work than the
original:
			Original	KAP-xformed
Fetches from memory	2*N*M(K)	  2*M(K)
Multiplies		  N*M(K)	    M(K)
Adds			  N*M(K)	  N*M(K)
Stores to memory	  N*M(K)	    M(K)

So the "MFLOPS" reported by the program represent work (the multiplies)
that was never done -- only a little over half as much arithmetic was
done in the KAP-transformed version.  And the "real work" of this program
is to access memory, and the KAP-transformed version does far less of
that.  For example, for a vector length of 1000 the KAP-transformed program
makes a single pass over the vectors A and B, while the original program
makes 2000 passes.  Similarly, for the two variants that use the PERM vector,
KAP eliminates practically all the indirect array addressing.

So, you're really measuring how fast the Alpha processor can issue floating-
point add instructions, and reporting 2x that rate as "MFLOPS".  This should
come out close to 2x the clock rate of the processor, since EV5 can issue
one add and one multiply every cycle.  Because Digital UNIX counts time by
recording interrupts that arrive at a rate of about 1024/sec, the timing is
quantized.  For example, on my 375 MHz system, I see times of 0.005856 (6/1024,
reporting 683 MFLOPS) and 0.004880 (5/1024, reporting 819 MFLOPS) alternating
randomly.  I suspect the true time is about 0.005333, corresponding to 750 MFLOPS
-- the other values represent quantization error.

Finally, GEM's -O5 optimization level ought to be able to do similar kinds of
transformations.  We'll look into why it isn't.

1324.2same customer in hdlite::SPEC#192DECC::MDAVISMark Davis - compiler maniacMon Jun 02 1997 12:380
1324.3Only memory traffic matters, to first approximationWIBBIN::NOYCEPulling weeds, pickin' stonesMon Jun 02 1997 15:1436
> And the "real work" of this program is to access memory, ...

This deserves more discussion.  Supercomputer users are used to
measuring the speed of their machines in MFLOPS -- millions of
floating-point operations per second.  This was appropriate in
the 1970's, when arithmetic units were was expensive, and memory
could keep up with them.  Today memory traffic is the scarce resource
-- you can get a first-order estimate of an application's performance
on a high-speed computer is to assume all the arithmetic is "free",
and just count the memory references.  John McCalpin (from U. of
Virginia, now at SGI) developed the STREAMS benchmark to measure how
quickly computers can access memory.  For a discussion and results,
see  http://www.cs.virginia.edu/stream/

For short vectors, EV5 systems can use on-chip caches for all
of the data accesses.  But once the vectors become long enough,
all the references go off-chip.  Typically, our EV5 systems can
transfer 16 bytes between the processor and the board-level cache
every 15-20 nanoseconds.  Since SAXPY requires reading 8 bytes
and writing 4 bytes, the memory traffic limits it to about one
iteration every 12-15 ns.  Since each iteration has two FLOPs,
that translates to peak performance of about 140-160 MFLOPS.

Many applications can use algorithms that reuse data either in
registers or caches.  SAXPY is one of the "BLAS-1" family of
routines, which don't take advantage of such reuse.  If your
application can use a "BLAS-2" or "BLAS-3" routine, it can
achieve significantly higher MFLOPS levels, and come close to
saturating the arithmetic units instead of the memory units.

Even if you can't restructure your application to sliminate memory
traffic or increase the reuse of data accessed from memory, you can
sometimes tune the program to perform better.  For example, in
the program referenced in .0, small changes to the value of BIG
(from 2000000 to 2000200, for example) can allow the cache to work
more effectively.
1324.4Yeah, not really measuring, but how to "translate" it?NAMIX::jptFIS and ChipsTue Jun 03 1997 09:1343
	Bill,

>The KAP-transformed program performs significantly less work than the
>original:

	This is exactly what I tried to explain to customer, not with
	as detailed explanation you gave though. The tone of customer
	("runs like three legged dog") etc caused me to be careful not
	to "find excuses" and go to in depth explanations. Based on couple
	of mails I have exchanged with customer I have a feeling that he
	is not very technical person by the means of understanding 
	memory/cache relationships and issues. He's original expectation
	is to get "multiply your clock by two and you should get MFLOPS
	value", and he clearly doesn't understand that MFLOPS is just
	kind of "normalized" number for certain test, and in many cases
	these MFLOPS values have been measured in certain super computers
	(mostly Cray) and doesn't indicate the number of REAL operations
	that will be executed in some other systems. 

	Well, I wonder if your answer (.1) is such that he could understand
	the "whole picture" without thinking that DIGITAL is just trying to
	explain why we "run like three legged dog..."?

	Btw: I don't think that our numbers were that bad even with -O5,
	BUT when customer expects 466MHz cpu to do 2*466=932 SAXPY MFLOPS,
	which is clearly unrealistic expectation, I think that we must be
	careful explaining this "right way". Maybe we should first explain
	that MFLOPS != MFLOPS != MFLOPS ;-) ?

	If I had time I'd use Iprof to profile the load and find hard
	facts on how and why code behaves like it does and calculate
	the numbers of real Integer/Floating operations executed, pipeline
	utilization etc.

>Finally, GEM's -O5 optimization level ought to be able to do similar kinds of
>transformations.  We'll look into why it isn't.

	Well, this was what I expected to see, and glad to hear that someone
	is looking at it.

	Thanks,

		-jari