T.R | Title | User | Personal Name | Date | Lines |
---|
673.1 | L1 & L2 don't depend on the platform | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Wed Apr 02 1997 17:09 | 2 |
| For the on-chip caches, see 538.14, though you'll have to
recalculate based on the clock speeds you care about.
|
673.1 | L1& L2 don't depend on the platform | WIBBIN::NOYCE | Pulling weeds, pickin' stones | Thu Apr 03 1997 09:18 | 4 |
| For the on-chip caches, see 528.14, though you'll have to
recalculate based on the clock speeds you care about.
<A previous version of this reply had the wrong note number>
|
673.2 | Pointers to existing stuff... | PERFOM::HENNING | | Thu Apr 03 1997 12:50 | 10 |
| Re: main memory bandwidth - I don't believe 440 changed from 350, nor
466 from 400 - though if minor differences of a percent or two are
crucial to you let me know and I'll try to remeasure.
It's been a few months since I updated my bandwidth web page, but if
you've not seen it, check out
http://tlg-www.zko.dec.com/~henning/Mem_bw.html and also the less
pleasant http://tlg-www.zko.dec.com/~henning/Mem_bw_internal.html
Both the above need to be updated for the Miata good news.
|
673.3 | How difficult to hack McCalpin to measure? | BBPBV1::WALLACE | john wallace @ bbp. +44 860 675093 | Fri Apr 04 1997 04:09 | 8 |
| Could we *measure* these figures given McCalpin source with suitably
modified (i.e. small) arrays (and suitable hardware to run the test
on)? Or does it not work like that?
Would that help?
regards
john
|
673.4 | | TKOV50::NAKANO | | Fri Apr 04 1997 10:56 | 7 |
| The memory bandwidth of MIATA without cache is better than with 2mb
cache. We have found it on several application. Also miata 433a is
better than ASt500 400. Could you tell me the reason?
Regards
mamoru
|
673.5 | cache actually hurts when it's too small | WRKSYS::SCHUMANN | | Fri Apr 04 1997 12:32 | 7 |
| An application that does not fit in the cache frequently runs faster
on a machine without cache. Each access that misses in the cache must first
make a cache access to find out about the miss, and these "wasted" cache
accesses increase the effective memory access time, and use 20-30% of
the bus bandwidth.
--RS
|
673.6 | | DECCXL::OUELLETTE | crunch | Fri Apr 04 1997 13:59 | 10 |
| > (i.e. small) arrays
The benchmark measures memory bandwidth.
You must choose array sizes larger than the largest cache.
That's the deal... otherwise you haven't run McCalpin's benchmark.
Prof McCalpin's field of study at the U. of Delaware (I think) was
(he's at SGI now) resivoir simulations and weather modeling.
None of his data stays in cache long ... that's why he (and people
like him) care ... that's why he wrote the benchmark.
|
673.8 | On-chip measurements; b-cache vs. no b-cache | PERFOM::HENNING | | Sun Apr 06 1997 22:47 | 106 |
| [repost with corrected total size calculations]
Yes one can measure bandwidth for on-chip caches and between b-cache and
CPU with a suitably modified McCalpin Streams bmark - you use rppc and
small array sizes. But trying to get close to the theoretical peak values
is tricky - Bob Nix wrote a memory test a while back in which he made the
comment:
*
* Code quality: These tests are sensitive to the quality of code
* generated by the compiler. The test results also include the loop control
* overhead. This overhead can be factored out of the latency tests by
* simply subtracting off a known latency, the L1 latency on a cache-line
* sized stride, from all results. The bandwidth results can't be adjusted
* in such a simple way -- these tests simply require a good optimizing compiler
* I've successfully compiled the bandwidth tests to run at ~80% of hardware pea
k on
* Alpha OSF, Alpha NT, HP Snake Unix, IBM Power1 Unix, and Pentium NT; but
* getting that performance required fiddling with source, switches, and careful
* checking against expected results.
*/
Running an rpcc'd McCalpin bmark with array sizes of 250 (~2k for each of
the three arrays, so all three should fit in the Dcache) just now with
Digital Fortran 77 V4.1-92-33BL on an EV5 @ 300 Mhz using
f77 -O5 -tune ev4 gave:
Function Rate (MB/s)
Assignment: 2096.3372
Scaling : 1564.0641
Summing : 2080.6025
SAXPYing : 1848.0018
f77 -O5 -tune ev5 gave:
Assignment: 2296.7518
Scaling : 1871.0299
Summing : 2509.4733
SAXPYing : 2438.1621
and f77 -tune ev5 (i.e. dropping the -O5) gave:
Assignment: 2204.0389
Scaling : 1562.0302
Summing : 2052.1661
SAXPYing : 1682.3546
Taking the best of the above (-O5 -tune ev5) and varying the array size
to 1000 (~8kb each array, ~24kb total, fits in one bank of S-cache) gave:
Assignment: 1714.1651
Scaling : 1749.7468
Summing : 1393.5049
SAXPYing : 1574.6575
Changing the array size to 4000 (which means ~32kb for each of the three
arrays, ~96kb total) gives:
Assignment: 1364.9044
Scaling : 825.7452
Summing : 972.5669
SAXPYing : 1058.3659
which is quite a drop - one suspects that not all 96kb was stored in the
s-cache. Switching to 20,000 elements (~1/2 mb total) drops down to:
Assignment: 410.1500
Scaling : 401.9950
Summing : 418.6805
SAXPYing : 436.3363
Finally, changing the array size to 1,000,000 (~24mb total) drops the
bandwidth down to main memory speed:
Assignment: 98.3563
Scaling : 105.1684
Summing : 108.7296
SAXPYing : 118.0426
CAUTION #1: The on-chip measurements above are for DIGITAL INTERNAL USE
ONLY. The on-chip figures listed in 528.14 are the right ones to quote
externally, not these, because of their extreme sensitivity to code quality
and the fact that there is no way to ensure a "level playing field"
between one vendor's on-chip vs another's figures.
CAUTION #2: You also should not quote the final figures above, the
bandwidth to main memory, because my available EV5 happened to be a
system that does not have stellar main memory bandwidth. If the
customer cares about main memory bandwidth, point out Turbolaser,
Rawhide, or Miata. See published results at
http://www.cs.virginia.edu/stream/standard/Bandwidth.html
Machine ID ncpus COPY SCALE ADD TRIAD
DEC_8400_5-350 1 215.7 207.5 219.6 234.2
DEC_4100_5-400- 1 247.8 243.9 264.9 268.0
DEC_433a-(0MB_L3) 1 292.6 292.6 323.4 341.3
As to B-cache or no B-cache, right, sometimes the B-cache will actually slow
you down. A little note on that subject is at
http://tlg-www.zko.dec.com/~henning/bcache.html
/John Henning
CSD Performance Group
|
673.9 | ..and. | RDGENG::WILLIAMS_A | | Mon Apr 07 1997 10:21 | 3 |
| and the 'on-chip' stuff gets 'wider' with a faster clock right ?
AW
|
673.10 | OK, a more contemporary chip | I4GET::HENNING | | Mon Apr 07 1997 13:11 | 60 |
| Per Adrian's request, a follow-on to .7 with a faster system.
The following data is for Digital Internal Use Only. It uses an Alpha
21164 near or at the MHz limits recently announced (see
http://www.digital.com/PR00SK) which has been incorporated into an as-yet
unannounced system. Actual mileage may vary. The precise system that I
am using may never be announced; product definition is subject to change.
Insert additional qualfiers here.
Anyway, with array size 250, Fortran 77 V4.1-92-33BL, f77 -O5 -tune ev5,
yes the achieved on-chip bandwidth goes up substantially to the dcache:
Function Rate (MB/s)
Assignment: 4579.7506
Scaling : 3736.6764
Summing : 5003.9197
SAXPYing : 4848.6024
And to the S-cache (Array size 1000)
Assignment: 3419.2856
Scaling : 3491.5592
Summing : 2779.2028
SAXPYing : 3140.5720
Array size 4000
Assignment: 2722.6025
Scaling : 1814.7246
Summing : 2047.1877
SAXPYing : 2201.5637
Here's an array size big enough to hit the board cache (20000) - note that
the system under test here has more than double the bw to the cache of the
system in .7:
Assignment: 928.6953
Scaling : 925.8906
Summing : 975.0332
SAXPYing : 1016.0961
And in fact this bandwidth to the (8mb) cache holds up fairly well even as
the array sizes are increased to 300,000 elements (~2.4mb each, ~7.2mb total):
Assignment: 890.7892
Scaling : 740.7456
Summing : 998.8359
SAXPYing : 918.9741
Finally, here's the main memory speed, with an array size of 3,000,000
elements (~72mb total)
Assignment: 266.0516
Scaling : 251.8598
Summing : 271.3901
SAXPYing : 277.2678
CAUTION: these numbers are for Digital Internal Use Only, as explained
in reply .7
|