[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference ricks::dechips

Title:	Hudson VLSI
Notice:	For Digital Chip Data - CHIPBZ::PRODUCTION$:[DS_INFO...]
Moderator:	RICKS::PHIPPS

Created:	Wed Feb 12 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	701
Total number of notes:	4658

673.0. "memory 'bandwidth' questions" by RDGENG::WILLIAMS_A () Wed Apr 02 1997 16:33

    Do we have the following info available anywhere ?
    
    What I want is the effective memory bandwidth between L1, L2 and L3
    cache, then from L3 to main memory for:
    
    8400 /440 (620Mhz too)
    
    4100 /466
    
    
    Oh, and if anyone has calculated similar for HP K460 and Sun UE6000, or
    can hazard a plausible estimate, then that would be great too.
    
    Mr Henning, you know what I am up to.....
    
    
    AW

T.R	Title	User	Personal Name	Date	Lines
673.1	L1 & L2 don't depend on the platform	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Wed Apr 02 1997 17:09`	2
	For the on-chip caches, see 538.14, though you'll have to recalculate based on the clock speeds you care about.
673.1	L1& L2 don't depend on the platform	WIBBIN::NOYCE	Pulling weeds, pickin' stones	`Thu Apr 03 1997 09:18`	4
	For the on-chip caches, see 528.14, though you'll have to recalculate based on the clock speeds you care about. <A previous version of this reply had the wrong note number>
673.2	Pointers to existing stuff...	PERFOM::HENNING		`Thu Apr 03 1997 12:50`	10
	Re: main memory bandwidth - I don't believe 440 changed from 350, nor 466 from 400 - though if minor differences of a percent or two are crucial to you let me know and I'll try to remeasure. It's been a few months since I updated my bandwidth web page, but if you've not seen it, check out http://tlg-www.zko.dec.com/~henning/Mem_bw.html and also the less pleasant http://tlg-www.zko.dec.com/~henning/Mem_bw_internal.html Both the above need to be updated for the Miata good news.
673.3	How difficult to hack McCalpin to measure?	BBPBV1::WALLACE	john wallace @ bbp. +44 860 675093	`Fri Apr 04 1997 04:09`	8
	Could we measure these figures given McCalpin source with suitably modified (i.e. small) arrays (and suitable hardware to run the test on)? Or does it not work like that? Would that help? regards john
673.4		TKOV50::NAKANO		`Fri Apr 04 1997 10:56`	7
	The memory bandwidth of MIATA without cache is better than with 2mb cache. We have found it on several application. Also miata 433a is better than ASt500 400. Could you tell me the reason? Regards mamoru
673.5	cache actually hurts when it's too small	WRKSYS::SCHUMANN		`Fri Apr 04 1997 12:32`	7
	An application that does not fit in the cache frequently runs faster on a machine without cache. Each access that misses in the cache must first make a cache access to find out about the miss, and these "wasted" cache accesses increase the effective memory access time, and use 20-30% of the bus bandwidth. --RS
673.6		DECCXL::OUELLETTE	crunch	`Fri Apr 04 1997 13:59`	10
	> (i.e. small) arrays The benchmark measures memory bandwidth. You must choose array sizes larger than the largest cache. That's the deal... otherwise you haven't run McCalpin's benchmark. Prof McCalpin's field of study at the U. of Delaware (I think) was (he's at SGI now) resivoir simulations and weather modeling. None of his data stays in cache long ... that's why he (and people like him) care ... that's why he wrote the benchmark.
673.8	On-chip measurements; b-cache vs. no b-cache	PERFOM::HENNING		`Sun Apr 06 1997 22:47`	106
	[repost with corrected total size calculations] Yes one can measure bandwidth for on-chip caches and between b-cache and CPU with a suitably modified McCalpin Streams bmark - you use rppc and small array sizes. But trying to get close to the theoretical peak values is tricky - Bob Nix wrote a memory test a while back in which he made the comment: * * Code quality: These tests are sensitive to the quality of code * generated by the compiler. The test results also include the loop control * overhead. This overhead can be factored out of the latency tests by * simply subtracting off a known latency, the L1 latency on a cache-line * sized stride, from all results. The bandwidth results can't be adjusted * in such a simple way -- these tests simply require a good optimizing compiler * I've successfully compiled the bandwidth tests to run at ~80% of hardware pea k on * Alpha OSF, Alpha NT, HP Snake Unix, IBM Power1 Unix, and Pentium NT; but * getting that performance required fiddling with source, switches, and careful * checking against expected results. */ Running an rpcc'd McCalpin bmark with array sizes of 250 (~2k for each of the three arrays, so all three should fit in the Dcache) just now with Digital Fortran 77 V4.1-92-33BL on an EV5 @ 300 Mhz using f77 -O5 -tune ev4 gave: Function Rate (MB/s) Assignment: 2096.3372 Scaling : 1564.0641 Summing : 2080.6025 SAXPYing : 1848.0018 f77 -O5 -tune ev5 gave: Assignment: 2296.7518 Scaling : 1871.0299 Summing : 2509.4733 SAXPYing : 2438.1621 and f77 -tune ev5 (i.e. dropping the -O5) gave: Assignment: 2204.0389 Scaling : 1562.0302 Summing : 2052.1661 SAXPYing : 1682.3546 Taking the best of the above (-O5 -tune ev5) and varying the array size to 1000 (~8kb each array, ~24kb total, fits in one bank of S-cache) gave: Assignment: 1714.1651 Scaling : 1749.7468 Summing : 1393.5049 SAXPYing : 1574.6575 Changing the array size to 4000 (which means ~32kb for each of the three arrays, ~96kb total) gives: Assignment: 1364.9044 Scaling : 825.7452 Summing : 972.5669 SAXPYing : 1058.3659 which is quite a drop - one suspects that not all 96kb was stored in the s-cache. Switching to 20,000 elements (~1/2 mb total) drops down to: Assignment: 410.1500 Scaling : 401.9950 Summing : 418.6805 SAXPYing : 436.3363 Finally, changing the array size to 1,000,000 (~24mb total) drops the bandwidth down to main memory speed: Assignment: 98.3563 Scaling : 105.1684 Summing : 108.7296 SAXPYing : 118.0426 CAUTION #1: The on-chip measurements above are for DIGITAL INTERNAL USE ONLY. The on-chip figures listed in 528.14 are the right ones to quote externally, not these, because of their extreme sensitivity to code quality and the fact that there is no way to ensure a "level playing field" between one vendor's on-chip vs another's figures. CAUTION #2: You also should not quote the final figures above, the bandwidth to main memory, because my available EV5 happened to be a system that does not have stellar main memory bandwidth. If the customer cares about main memory bandwidth, point out Turbolaser, Rawhide, or Miata. See published results at http://www.cs.virginia.edu/stream/standard/Bandwidth.html Machine ID ncpus COPY SCALE ADD TRIAD DEC_8400_5-350 1 215.7 207.5 219.6 234.2 DEC_4100_5-400- 1 247.8 243.9 264.9 268.0 DEC_433a-(0MB_L3) 1 292.6 292.6 323.4 341.3 As to B-cache or no B-cache, right, sometimes the B-cache will actually slow you down. A little note on that subject is at http://tlg-www.zko.dec.com/~henning/bcache.html /John Henning CSD Performance Group
673.9	..and.	RDGENG::WILLIAMS_A		`Mon Apr 07 1997 10:21`	3
	and the 'on-chip' stuff gets 'wider' with a faster clock right ? AW
673.10	OK, a more contemporary chip	I4GET::HENNING		`Mon Apr 07 1997 13:11`	60
	Per Adrian's request, a follow-on to .7 with a faster system. The following data is for Digital Internal Use Only. It uses an Alpha 21164 near or at the MHz limits recently announced (see http://www.digital.com/PR00SK) which has been incorporated into an as-yet unannounced system. Actual mileage may vary. The precise system that I am using may never be announced; product definition is subject to change. Insert additional qualfiers here. Anyway, with array size 250, Fortran 77 V4.1-92-33BL, f77 -O5 -tune ev5, yes the achieved on-chip bandwidth goes up substantially to the dcache: Function Rate (MB/s) Assignment: 4579.7506 Scaling : 3736.6764 Summing : 5003.9197 SAXPYing : 4848.6024 And to the S-cache (Array size 1000) Assignment: 3419.2856 Scaling : 3491.5592 Summing : 2779.2028 SAXPYing : 3140.5720 Array size 4000 Assignment: 2722.6025 Scaling : 1814.7246 Summing : 2047.1877 SAXPYing : 2201.5637 Here's an array size big enough to hit the board cache (20000) - note that the system under test here has more than double the bw to the cache of the system in .7: Assignment: 928.6953 Scaling : 925.8906 Summing : 975.0332 SAXPYing : 1016.0961 And in fact this bandwidth to the (8mb) cache holds up fairly well even as the array sizes are increased to 300,000 elements (~2.4mb each, ~7.2mb total): Assignment: 890.7892 Scaling : 740.7456 Summing : 998.8359 SAXPYing : 918.9741 Finally, here's the main memory speed, with an array size of 3,000,000 elements (~72mb total) Assignment: 266.0516 Scaling : 251.8598 Summing : 271.3901 SAXPYing : 277.2678 CAUTION: these numbers are for Digital Internal Use Only, as explained in reply .7