T.R | Title | User | Personal Name | Date | Lines |
---|
596.1 | | BEING::POSTPISCHIL | Always mount a scratch monkey. | Thu Oct 16 1986 18:23 | 11 |
| In spite of the formula, the geometric means seem to have been
calculated correctly.
I do not believe the error in the comparisons lies with the arithmetic
mean, but with the "normalization". The "normalization" the author
uses is basically a way of assigning values to the various results, and
it assigns the highest values to the benchmarks the chosen system did
best in.
-- edp
|
596.2 | Known problem | SQM::HALLYB | Free the quarks! | Fri Oct 17 1986 13:26 | 38 |
| If you look at the original data and compute the product of all
3 "times" you get 24000 for all 4 systems. Not at all surprising
that the (correcty-calculated) geometric mean is the same in all
cases.
/* This kind of technique has been bouncing around the performance
community for quite some time now. The main problem is that there
is no easy way to compare different CPUs on the basis of some number
of benchmarks. */
The need for "normalization" comes from a problem with the arithmetic
mean. If you run benchmarks x and y on systems P and Q, and get
times that look like:
P Q
x 100. 10.0
y 0.1 1.0
you can see that benchmark x really dominates the whole set, and
the contribution of y is irrelevant. So in order to give equal
weight to x and y, you normalize with respect to one of the CPUs,
and end up with:
P Q
x 1. 0.1
y 1. 10.0
Then the arithmetic mean of P is 1, while the mean of Q is > 5.
Of course if you normalize with respect to Q the situation is reversed.
It is not unusual to see this kind of raw data, owing to various
compiler optimizations.
The geometric mean is independent of normalization since it is already
a "multiplicative entity". I believe the original article intended
to point out that the arithmetic mean is a poor way to compare CPU
times, and the geometric mean is more useful.
John
|
596.3 | | BEING::POSTPISCHIL | Always mount a scratch monkey. | Fri Oct 17 1986 21:03 | 15 |
| Re .2:
If normalization is necessary, you certainly don't do it by reducing
the effect of the bad or dominating benchmarks!
A better way to do it is to figure out how relevant the various
benchmarks are for your system. For example, you might figure that
sixty percent of your work will be somewhat like benchmark x and forty
percent will be like benchmark y. Use those figures to adjust the
data. If that still leaves one benchmark dominating the other,
that is good, because, when you by the systems, that portion of the
work will be dominating the other.
-- edp
|
596.4 | Up to 4.2 times more useful | SQM::HALLYB | Free the quarks! | Sat Oct 18 1986 01:51 | 29 |
| Re .3:
Yes, that is the standard suggestion that is made at this point
in the argument. Unfortunately at this point we tend to stray from
the MATH content and enter an unrelated topic. So to keep it brief,
suffice it to say that the benchmarks x, y, z, ... have little if
anything to do with actual workloads. They're just a random bunch
of programs that get passed on from one young generation to another.
Occasionally somebody will add in a program so as to contribute
to the sum total of Human Knowledge, but almost invariably the programs
added have the characteristic of being fairly easy to code and most
importantly being very easy to run. Hence they tend to do either
no IO or sometimes they do IO exclusively, but rarely indeed is there
ever an attempt made to actually model a workload and even then it's
a general workload, not anything site-specific. Some exceptions exist.
The next question usually is along the lines of "Well why run all
these silly little benchmarks if they don't mean anything?" There
isn't much of an answer to this except that these little programs
are about the only way to make any kind of comparisons across a
wide variety of processors for a wide variety of customers, and
even if the data is only vaguely useful it's better than comparing
raw instruction timings and IO bus bandwidths. Certainly better
approaches exist but they involve a LOT of work to instrument an
existing workload and then generate a synthetic workload to duplicate
the observed one. Most customers can't afford to do that, and at
times the workload to be predicted doesn't yet exist.
John
|
596.5 | yes | TOOK::APPELLOF | Carl J. Appellof | Mon Oct 20 1986 14:01 | 9 |
| I agree that the problem is in the "normalization".
there are really two components to this: the first, as pointed
out, is in weighting the benchmarks according to how important they
are to YOUR workload. The second, and only mathematical reason,
is to reduce results of various benchmarks to some common scale
so that an arithmetic mean can make sense.
Obviously, the method of standardizing each benchmark against
a different machine is not the way to do it.
|
596.6 | Doug Clark gave an excellent lecture on a related topic | EAGLE1::BEST | R D Best, Systems architecture, I/O | Sun Oct 26 1986 01:55 | 16 |
|
> In case anyone is interested, Doug Clark gave a very interesting
> (and amusing) talk on the rampant misuse of benchmarks about a year ago
> at an LTN technical forum. I believe that it was entitled something
> like 'Ten Awful Ways to Measure Computer Performance'. He discusses
> the effects of neglecting realistic cache hit ratios, compiler effects,
> why certain commonly used benchmarks are notoriously bad indicators of
> real life computer usage, AND the specious use of statistics and math
> by hardware manufacturers (including us) and other 'trick of the trade'.
> I believe it was recorded and should be available on videotape from the LTN
> library. I can almost guarantee that this talk will have you rolling on
> the floor. I give it my vote for one of the all time best lectures I've
> attended.
> /R Best
|
596.7 | Median or mean | AIWEST::DRAKE | Dave (Diskcrash) Drake 619-292-1818 | Sun Oct 26 1986 03:44 | 22 |
| A few thoughts:
re:0 The arithmetic mean is not usually also the median. The median
is the value that has 50% of the observations above and below it.
In fact I have found that median based figures of merit are very
useful in a wide class of analysis problems. I have used them in
image processing to "cast out" bad data rather tha forming linear
filters that include it. The median would in fact be a good comparison
mechanism as it would help ignore benchmark extrema.
No question, benchmarks are a pit. We try to quantify some simple
"figure of merit" about a very complex system such as a 8800. I
would think that it would be better to distill each processor into
its component queueing mechanisms and provide quantitative data
about the server time of each queue. (A queue in this case means
any system resource that is consumed in common by processes.) Each
processor would end up with say 5 to 10 values that would be used
for comparison purposes. Someone would probably come along and find
the norm of the 5 to 10 valued vector and call this the "performance".
If we did this we could more accurately compare new applications
against our systems. All I can say is MIPS are for DIPS.
|