[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference turris::fortran

Title:	Digital Fortran
Notice:	Read notes 1.* for important information
Moderator:	QUARK::LIONEL

Created:	Thu Jun 01 1995
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	1333
Total number of notes:	6734

1324.0. "Customer Q's F77 vs KAP" by NAMIX::jpt (FIS and Chips) Mon Jun 02 1997 03:35


	One customer asked about SAXPY performance, and my advise was
	to use KAP, which really exploits the performance of Alpha in
	SAXPY. Well now customer is happy with this, BUT he has some
	more or less political questions, and I'd like to hear 
	"right" answers to customer's concerns:

	regards,

		jari

 -------
Subject: 
       SUMMARY: how is your fp performance?
  Date: 
       Fri, 30 May 1997 19:20:59 +0300
  From: 
       vbormc::[email protected] (MAIL-11 Daemon)
    To: 
       [email protected]


Hello and thanks to all responses.

The answer lies in loop unrolling a la KAP. Nick Hill at 
Rutherford Labs has supplied me with a KAP version of the
SAXPY that produces a fabulous 960MFLOPS. I have left this
code for you all to try at...
http://www.gre.ac.uk/~k.mcmanus/saxpy.kap.f
and the original
http://www.gre.ac.uk/~k.mcmanus/saxpy.f
is still there for comparison.
A point of interest is that the KAP version doubles the FLOP
rate on Sun machines.

This raises some intriguing questions.....

1       How come the FLOP rate is more than twice the clock rate
        of 466MHz??

2       Why can the compiler not manage this elementary transform??

3       Is this a conspiracy to raise royalty for Kuck??

4       Has anybody compared this rather poor compiler performance
        against the impressive SGI v6 compiler??

5       Why did DEC not tell me before buying Half a million
        bucks of kit that without KAP it would run like a three
        legged dog??

Answers on an email please to

[email protected]  -  http://www.gre.ac.uk/~k.mcmanus
-------------------------------------------------------------
Dr Kevin McManus                     ||
School of Computing & Math Science   ||
The University of Greenwich          ||
Wellington St.  Woolwich             ||Tel +44 (0)181 331 8719 
London SE18 6PF  UK                  ||Fax +44 (0)181 331 8665

T.R	Title	User	Personal Name	Date	Lines
1324.1	Not measuring what you think	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Mon Jun 02 1997 10:25`	54
	KAP took this loop: DO J = 1, N DO I = 1, M(K) B(I) = ALPHAA(I) + B(I) END DO END DO which is intended to measure N executions of SAXPY, each of length M(K), and turned it into (approximately) this: DO I=1,M(K) T1 = B(I) T2 = ALPHAA(I) DO J = 1, N T1 = T1 + T2 ENDDO B(I) = T1 ENDDO This does N SAXPY's at once -- but only if they are all adding the same multiple of the same vectors together. Surely this isn't representative of your real application. The KAP-transformed program performs significantly less work than the original: Original KAP-xformed Fetches from memory 2NM(K) 2M(K) Multiplies NM(K) M(K) Adds NM(K) NM(K) Stores to memory N*M(K) M(K) So the "MFLOPS" reported by the program represent work (the multiplies) that was never done -- only a little over half as much arithmetic was done in the KAP-transformed version. And the "real work" of this program is to access memory, and the KAP-transformed version does far less of that. For example, for a vector length of 1000 the KAP-transformed program makes a single pass over the vectors A and B, while the original program makes 2000 passes. Similarly, for the two variants that use the PERM vector, KAP eliminates practically all the indirect array addressing. So, you're really measuring how fast the Alpha processor can issue floating- point add instructions, and reporting 2x that rate as "MFLOPS". This should come out close to 2x the clock rate of the processor, since EV5 can issue one add and one multiply every cycle. Because Digital UNIX counts time by recording interrupts that arrive at a rate of about 1024/sec, the timing is quantized. For example, on my 375 MHz system, I see times of 0.005856 (6/1024, reporting 683 MFLOPS) and 0.004880 (5/1024, reporting 819 MFLOPS) alternating randomly. I suspect the true time is about 0.005333, corresponding to 750 MFLOPS -- the other values represent quantization error. Finally, GEM's -O5 optimization level ought to be able to do similar kinds of transformations. We'll look into why it isn't.
1324.2	same customer in hdlite::SPEC#192	DECC::MDAVIS	Mark Davis - compiler maniac	`Mon Jun 02 1997 11:38`	0
1324.3	Only memory traffic matters, to first approximation	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Mon Jun 02 1997 14:14`	36
	> And the "real work" of this program is to access memory, ... This deserves more discussion. Supercomputer users are used to measuring the speed of their machines in MFLOPS -- millions of floating-point operations per second. This was appropriate in the 1970's, when arithmetic units were was expensive, and memory could keep up with them. Today memory traffic is the scarce resource -- you can get a first-order estimate of an application's performance on a high-speed computer is to assume all the arithmetic is "free", and just count the memory references. John McCalpin (from U. of Virginia, now at SGI) developed the STREAMS benchmark to measure how quickly computers can access memory. For a discussion and results, see http://www.cs.virginia.edu/stream/ For short vectors, EV5 systems can use on-chip caches for all of the data accesses. But once the vectors become long enough, all the references go off-chip. Typically, our EV5 systems can transfer 16 bytes between the processor and the board-level cache every 15-20 nanoseconds. Since SAXPY requires reading 8 bytes and writing 4 bytes, the memory traffic limits it to about one iteration every 12-15 ns. Since each iteration has two FLOPs, that translates to peak performance of about 140-160 MFLOPS. Many applications can use algorithms that reuse data either in registers or caches. SAXPY is one of the "BLAS-1" family of routines, which don't take advantage of such reuse. If your application can use a "BLAS-2" or "BLAS-3" routine, it can achieve significantly higher MFLOPS levels, and come close to saturating the arithmetic units instead of the memory units. Even if you can't restructure your application to sliminate memory traffic or increase the reuse of data accessed from memory, you can sometimes tune the program to perform better. For example, in the program referenced in .0, small changes to the value of BIG (from 2000000 to 2000200, for example) can allow the cache to work more effectively.
1324.4	Yeah, not really measuring, but how to "translate" it?	NAMIX::jpt	FIS and Chips	`Tue Jun 03 1997 08:13`	43
	Bill, >The KAP-transformed program performs significantly less work than the >original: This is exactly what I tried to explain to customer, not with as detailed explanation you gave though. The tone of customer ("runs like three legged dog") etc caused me to be careful not to "find excuses" and go to in depth explanations. Based on couple of mails I have exchanged with customer I have a feeling that he is not very technical person by the means of understanding memory/cache relationships and issues. He's original expectation is to get "multiply your clock by two and you should get MFLOPS value", and he clearly doesn't understand that MFLOPS is just kind of "normalized" number for certain test, and in many cases these MFLOPS values have been measured in certain super computers (mostly Cray) and doesn't indicate the number of REAL operations that will be executed in some other systems. Well, I wonder if your answer (.1) is such that he could understand the "whole picture" without thinking that DIGITAL is just trying to explain why we "run like three legged dog..."? Btw: I don't think that our numbers were that bad even with -O5, BUT when customer expects 466MHz cpu to do 2*466=932 SAXPY MFLOPS, which is clearly unrealistic expectation, I think that we must be careful explaining this "right way". Maybe we should first explain that MFLOPS != MFLOPS != MFLOPS ;-) ? If I had time I'd use Iprof to profile the load and find hard facts on how and why code behaves like it does and calculate the numbers of real Integer/Floating operations executed, pipeline utilization etc. >Finally, GEM's -O5 optimization level ought to be able to do similar kinds of >transformations. We'll look into why it isn't. Well, this was what I expected to see, and glad to hear that someone is looking at it. Thanks, -jari