T.R | Title | User | Personal Name | Date | Lines |
---|
1324.1 | Not measuring what you think | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Mon Jun 02 1997 11:25 | 54 |
| KAP took this loop:
DO J = 1, N
DO I = 1, M(K)
B(I) = ALPHA*A(I) + B(I)
END DO
END DO
which is intended to measure N executions of SAXPY, each of length M(K),
and turned it into (approximately) this:
DO I=1,M(K)
T1 = B(I)
T2 = ALPHA*A(I)
DO J = 1, N
T1 = T1 + T2
ENDDO
B(I) = T1
ENDDO
This does N SAXPY's at once -- but only if they are all adding the same
multiple of the same vectors together. Surely this isn't representative
of your real application.
The KAP-transformed program performs significantly less work than the
original:
Original KAP-xformed
Fetches from memory 2*N*M(K) 2*M(K)
Multiplies N*M(K) M(K)
Adds N*M(K) N*M(K)
Stores to memory N*M(K) M(K)
So the "MFLOPS" reported by the program represent work (the multiplies)
that was never done -- only a little over half as much arithmetic was
done in the KAP-transformed version. And the "real work" of this program
is to access memory, and the KAP-transformed version does far less of
that. For example, for a vector length of 1000 the KAP-transformed program
makes a single pass over the vectors A and B, while the original program
makes 2000 passes. Similarly, for the two variants that use the PERM vector,
KAP eliminates practically all the indirect array addressing.
So, you're really measuring how fast the Alpha processor can issue floating-
point add instructions, and reporting 2x that rate as "MFLOPS". This should
come out close to 2x the clock rate of the processor, since EV5 can issue
one add and one multiply every cycle. Because Digital UNIX counts time by
recording interrupts that arrive at a rate of about 1024/sec, the timing is
quantized. For example, on my 375 MHz system, I see times of 0.005856 (6/1024,
reporting 683 MFLOPS) and 0.004880 (5/1024, reporting 819 MFLOPS) alternating
randomly. I suspect the true time is about 0.005333, corresponding to 750 MFLOPS
-- the other values represent quantization error.
Finally, GEM's -O5 optimization level ought to be able to do similar kinds of
transformations. We'll look into why it isn't.
|
1324.2 | same customer in hdlite::SPEC#192 | DECC::MDAVIS | Mark Davis - compiler maniac | Mon Jun 02 1997 12:38 | 0 |
1324.3 | Only memory traffic matters, to first approximation | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Mon Jun 02 1997 15:14 | 36 |
| > And the "real work" of this program is to access memory, ...
This deserves more discussion. Supercomputer users are used to
measuring the speed of their machines in MFLOPS -- millions of
floating-point operations per second. This was appropriate in
the 1970's, when arithmetic units were was expensive, and memory
could keep up with them. Today memory traffic is the scarce resource
-- you can get a first-order estimate of an application's performance
on a high-speed computer is to assume all the arithmetic is "free",
and just count the memory references. John McCalpin (from U. of
Virginia, now at SGI) developed the STREAMS benchmark to measure how
quickly computers can access memory. For a discussion and results,
see http://www.cs.virginia.edu/stream/
For short vectors, EV5 systems can use on-chip caches for all
of the data accesses. But once the vectors become long enough,
all the references go off-chip. Typically, our EV5 systems can
transfer 16 bytes between the processor and the board-level cache
every 15-20 nanoseconds. Since SAXPY requires reading 8 bytes
and writing 4 bytes, the memory traffic limits it to about one
iteration every 12-15 ns. Since each iteration has two FLOPs,
that translates to peak performance of about 140-160 MFLOPS.
Many applications can use algorithms that reuse data either in
registers or caches. SAXPY is one of the "BLAS-1" family of
routines, which don't take advantage of such reuse. If your
application can use a "BLAS-2" or "BLAS-3" routine, it can
achieve significantly higher MFLOPS levels, and come close to
saturating the arithmetic units instead of the memory units.
Even if you can't restructure your application to sliminate memory
traffic or increase the reuse of data accessed from memory, you can
sometimes tune the program to perform better. For example, in
the program referenced in .0, small changes to the value of BIG
(from 2000000 to 2000200, for example) can allow the cache to work
more effectively.
|
1324.4 | Yeah, not really measuring, but how to "translate" it? | NAMIX::jpt | FIS and Chips | Tue Jun 03 1997 09:13 | 43 |
| Bill,
>The KAP-transformed program performs significantly less work than the
>original:
This is exactly what I tried to explain to customer, not with
as detailed explanation you gave though. The tone of customer
("runs like three legged dog") etc caused me to be careful not
to "find excuses" and go to in depth explanations. Based on couple
of mails I have exchanged with customer I have a feeling that he
is not very technical person by the means of understanding
memory/cache relationships and issues. He's original expectation
is to get "multiply your clock by two and you should get MFLOPS
value", and he clearly doesn't understand that MFLOPS is just
kind of "normalized" number for certain test, and in many cases
these MFLOPS values have been measured in certain super computers
(mostly Cray) and doesn't indicate the number of REAL operations
that will be executed in some other systems.
Well, I wonder if your answer (.1) is such that he could understand
the "whole picture" without thinking that DIGITAL is just trying to
explain why we "run like three legged dog..."?
Btw: I don't think that our numbers were that bad even with -O5,
BUT when customer expects 466MHz cpu to do 2*466=932 SAXPY MFLOPS,
which is clearly unrealistic expectation, I think that we must be
careful explaining this "right way". Maybe we should first explain
that MFLOPS != MFLOPS != MFLOPS ;-) ?
If I had time I'd use Iprof to profile the load and find hard
facts on how and why code behaves like it does and calculate
the numbers of real Integer/Floating operations executed, pipeline
utilization etc.
>Finally, GEM's -O5 optimization level ought to be able to do similar kinds of
>transformations. We'll look into why it isn't.
Well, this was what I expected to see, and glad to hear that someone
is looking at it.
Thanks,
-jari
|