| > Q: Does this imply that a 21164 running at the same clock speed could
> be
> up to 33% faster than a 21064A?
Well, I don't know whether it implies it or not. The fact is that for some
programs a 21164 is twice as fast as a 21064A at the same clock speed, since
it can issue twice as many floating-point operations per cycle.
You've ignored another enormous difference between the processors: size of
on-chip caches. The 21064A has 16KB I-cache & 16KB D-cache. The 21164 has
8KB I-cache and 8KB D-cache, plus a 96KB on-chip level-2 cache. For applications
that benefit from the level-2 cache, the 21164 can make a very big difference.
For SPECint, 21164-based machines tend to perform a bit less than twice
as fast as 21064A-based machines at the same clock rate.
For SPECfp, 21164-based machines tend to perform a bit more than twice as
fast as 21064A-based machines at the same clock rate.
Most of the time, for most applications, even a 21164 is not issuing many
instructions per cycle. In fact, for many important applications, it averages
well under one per cycle. This is mainly because it spends its time waiting
for memory (that's one reason the on-chip L2 cache helps, but it's not enough.)
The 21264 attacks this in two ways. It provides much faster access to off-chip
cache and main memory, and it allows instructions to issue out-of-order, even
if earlier instructions are still waiting for their inputs to become available.
(The latter technique is the reason Pentium Pro is twice as fast as Pentium at
the same clock rate for many applications.)
As a rough rule of thumb, expect a 21264 to have twice the performance of a
21164 at the same clock rate.
|
| For lots of details on 21264, see the presentation from last October's
Microprocessor Forum:
http://www.digital.com/semiconductor/a264up1/index.html
To answer the questions I think I saw, the 21264 can
- fetch 4 instructions per cycle
- issue 4 integer ops (including two loads and/or stores)
plus a floating add,sub,div, or sqrt, plus a floating mul
(total of 6 instructions) per cycle
- retire up to 11 instructions per cycle
Floating latency is 4 cycles (like 21164), load latency for D-cache hits
is 3 cycles (like 21064). On-chip caches are 64K I-cache, 64K D-cache,
two-way set-associative.
|