T.R | Title | User | Personal Name | Date | Lines |
---|
1710.1 | My shot. | CADSYS::COOPER | Topher Cooper | Wed Jan 13 1993 15:46 | 40 |
| Well I'm not sure I understand the problem, but I'll take a shot.
If you can assume that some number of samples are drawn from the same
distribution (i.e., same mean and variance) then you can calculate the
variance of each sample from the variance of the means around their
common mean.
I suspect that you are looking for the variance of each sample,
however, and there really isn't enough information for this.
If I understand what you are telling us, you have a bunch of
transmission times and a bunch of reception times but no way of
identifying corresponding times. Now imagine the following sample:
You have transmission times of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10.
You have reception times of 60, 61, 62, 63, 64, 65, 66, 67, 68, 69 and
70. The mean time is 60.
Now, if the channel is FIFO, so that the 0 tranmission corresponds to
the 60 reception, the 1 transmission corresponds to the 61 reception,
and so on. Each transmission takes exactly 60 (I'm ignoring rounding
effects) so the variance is 0.
But, if the channel is LIFO, so the correspondences are 0:70, 1:69,
2:68, etc. then the variance works out to 44. Which is considerably
different from the FIFO case.
Since the data really is ambiguous no computational trick will allow us
to pull an unambiguous variance out of the hat. These two cases do,
however, provide upper and lower bounds which might or might not be
good enough for your purposes. The physical situation is such that
the lower bound is likely to be much closer to the true value than the
upper bound.
Perhaps you have enough knowledge of the process (e.g., an estimate
of how often objects cross in transmission. Then we could do better.
What have I totally misunderstood?
Topher
|
1710.2 | Bullseye ! | MARVA2::RAK | | Wed Jan 13 1993 17:31 | 43 |
| You understood the problem exactly !
And, you've taught me how we can determine upper and lower
bounds for the variance. I'll present it to our customer.
Any comments on the following approach? It doesn't
fully comply with our requirements.
I don't have any ideas (yet) on how to squeeze out
more statistics out than you've outlined while staying strictly
within the bounds of the problem. BUT, I have an idea that
may be useful if some of the "other information" associated
with each object is invariant during the object's life
and allows discrimination between classes of objects.
That is, we don't have a full object-id but we can obtain
several "classes" of objects.
EXAMPLE:
Imagine a sample with 6 objects that can be placed
into 3 classes (c1, c2, c3).
Mean travel times can be computed by groupings
of classes for each link/sublink.
Group: g1 g2 g3 g4 g5 g6 g7
Classes: c1 c2 c3 c1+c2 c1+c3 c2+c3 c1+c2+c3
X value: 2 2 2 1 1 1 0
x = number of classes in final group (3) -
number of classes in current group
y = absolute value of [current group mean - final group mean)]
The shape and slope of the curve is related to the variance.
(I expect the curve will asymptotically approach some value
that can be correlated to the true variance.)
If all y values are 0, the variance is (intuitively) lower than
cases where the y value increases wildly with each smaller
subset of the entire sample.
I don't know the best way to proceed mathematically.
Once could find a function that best describes the
relationship between X and Y and extrapolate to the
x value where each class consists of a single object.
|
1710.3 | Finding error in aproximate variance | MARVA1::RAK | | Fri Jan 15 1993 09:34 | 24 |
| Posted for a fellow employee who can't get to Notes now:
I need to establish the bounds of the error in approximating
sample variances. Again, I have two sets of transactions,
one consisting of true departure times and the other consisting of
true arrival times. But there is no way to associate the departure
transaction of an object with the arrival transaction of the
same object.
I can calculate the EXACT mean of the sample. From the sample
mean, I want to approximate the sample variance by assuming
all objects depart at the same time. For this, I choose the
median of the true departure times. What is the approximation
error as a percentage of the true variance? For concreteness,
use two hours as the range of the true departure times, five days
as the sample mean, and normal distribution for the arrival times.
A general formula will be appreciated.
A related question is how to recalculate the error when combining
samples. Will it stay the same or get worse?
This is not a hypothetical situation. My client's existing
equipment can't identify individual objects in a sample unless
they spend a fortune upgrading it.
|
1710.4 | First stab part I. | CADSYS::COOPER | Topher Cooper | Fri Jan 15 1993 18:16 | 81 |
| RE: .3
OK. Let's give this a shot. Right off I should say that I don't think
that pretending that all the departures are at the median departure is
a very good choice of estimates unless the arival times are *much* more
spread out than the departure times, in which case it doesn't matter
much what model you use.
The bottom line is that there is no firm answer to this. To know what
the "true" error is I would have to have a model of the transport
process. One such (not totally unreasonable) model is that the
transport process acts like a FIFO queue: the objects are transported
on a strictly first come first serve basis. Another (probably not
particularly reasonable) model is that the transport process acts like
a LIFO queue: the objects are collected somewhere and then sent out in
reverse order of arival with order strictly maintained before and after
the storage.
As I said before these form bounding cases: upper and lower bounds
for the true sample variance. There are also some other models we
can look at. But first lets develop some formulas which will help
us compare (warning I'm basically doing this as I write -- it is
thus very much subject to dumb algebraic mistakes, check carefully
before using).
Assume, for the outset, that we do know what arrival times go with
which departure times for our n objects. Arranged in no particular
order we can call the departure time for the i'th object d[i] and its
arrival time a[i]. The transport time t[i] = a[i] - d[i]. First
off we want the mean transport time, T. We find that T = A - D
where A is the mean arrival time and D is the mean departure time.
This does not depend on which a goes with which d so this can be
calculated in a model independent way.
Now we are interested in the sample variance. Let V[t] be the variance
of the transport times, V[a] be the variance of the arrival times
and V[d] be the variance of the departure times. Then
V[t] = V[d] + V[a] - 2*{sum(a[i]*d[i]) - n*A*D}/(n-1)
We can split this into a model independent part:
Vi[t] = V[d] + V[a] + 2*n*A*D/(n-1)
and a model dependent part:
Vd[t] = 2*sum(a[i]*d[i])/(n-1)
V[t] = Vi[t] - Vd[t]
Let's look at a variation of the compuational method you used. Its
the same except we will use the mean, D, rather than the median of the
departure times. We can use the same formula as above except all the
d[i]s = D, and V[d] = 0. We get:
V'[t] = 0 + V[a] + 2*n*A*D/(n-1) - 2*sum(a[i]*D)/(n-1)
= V[a] + 2*n*A*D/(n-1) - 2*D*sum(a[i])/(n-1)
= V[a] + 2*n*A*D/(n-1) - 2*D*n*A/(n-1)
= V[a]
This isn't too surprising result, basically if we assume that there is
no variation in the departure time, all the variance of the arrival
time is due to the variance of the transportation.
If we use a constant value other than D, say D+O, then we get
V'[t] = V[a] + 2AO/(n-1)
we get some additional variance. This simply reflects that we are, in
a sense taking a variance around something other than the mean, and the
mean (generally refered to as simply the variance) is the point which
produces the minimal variance. So using the median will add a factor
but it doesn't really mean much. It might actually improve the
estimate, but only because it adds back in a bit which may compensate
for ignoring the variance of d (in fact, the expected O for the median
will be, as I remember, proportional to V[d] with a proportionality
which depends on the distribution of d, and a bias which depends on the
skew of the distribution of d).
Time to go home. More later.
Topher
|
1710.5 | Improved formula | CADSYS::COOPER | Topher Cooper | Mon Jan 18 1993 14:36 | 61 |
| RE: .4
Driving home from work Friday, I thought of an improvement in the
equation I posted in .4.
Start with the same formula for the variance of the transport times:
V[t] = V[d] + V[a] - 2*{sum(a[i]*d[i]) - n*A*D}/(n-1)
Define quantities aa[i] and dd[i] such that:
a[i] = A + aa[i]
d[i] = D + dd[i]
That is, aa and dd are the offsets from the mean arrival and departure
times respectively. If we substitute these for the part of the
equation above inside the {} we get:
... sum(a[i]*d[i]) - n*A*D ... =
... sum((A+aa[i])*(D+dd[i]) - n*A*D ... =
... sum(A*D) + sum(A*dd[i]) + sum(D*aa[i]) + sum(aa[i]*dd[i]) -
n*A*D ... =
... n*A*D + A*sum(dd[i]) + D*sum(aa[i]) + sum(aa[i]*dd[i]) -
n*A*D ...
But sum(dd[i]) = sum(aa[i]) = 0, so this comes to:
... sum(aa[i]*dd[i]) ...
Now if we split the equation into model independent and dependent parts
we get the simpler:
Vi[t] = V[d] + V[a]
Vd[t] = 2*sum(aa[i]*dd[i])/(n-1)
V[t] = Vi[t] - Vd[t]
This form makes it easier to see mathematically, why the FIFO and LIFO
cases are lower and upper bounds in the variance. Think of each offset
(aa[i] and dd[i]) as cosisting of a sign (+/-) and a magnitude. Take
the magnitudes of the aa's as weights to determine how much to "count"
the value of each dd magnitude in the sum. Both FIFO and LIFO put the
maximum weights on the large magnitude values. The big values get
counted the most, while the small weights are put on the small values.
You get the same effect, of course, if you make dd the weights and
aa the values to be weighted.
In the FIFO case, you get the signs matching: + with + and - with -.
(If the number of -'s aren't the same, the mismatch, where + goes with
-, will be in the "middle" which is where the weights are low and
there is therefore little effect). The products will therefore all
be positive (except possibly on the small values, as I said) and the
sum will be positive. Since the sum is subtracted from the model
indendent part, this will produce a minimum possible value.
In the LIFO case, the signs will criss-cross: + with - and - with +.
The products will all (except, possibally for some small values) be
negative and the sum will be negative. This negative sum will be
subtracted -- which means we are "really" adding -- and so a maximum
variance will be produced.
Topher
|
1710.6 | First In Random Out | CADSYS::COOPER | Topher Cooper | Mon Jan 18 1993 16:20 | 60 |
| Now I'm going to discuss another possible model for transport.
The assumption made in this model is that the transport mechanism
essentially ignores the time or order in which the departures are
made. This might be the case if, for example, the objects sent out
in any particular observation window are stored and sent out all
at once.
Another way of viewing this particular model is that it is an "average"
model: it represents the "average" behavior of all the different
possible models, weighing each one equally.
The expected variance is the average variance for each possible way
that input objects might be associated with output objects. There
are n! such combinations. In finding the average of these variances,
we add up the V[t]'s for each model and divide by n!. The model
independent part is going to be the same each time, so it will be
added n! times then the result will be divided by n! so we get:
V[t] = Vi[t] - sum-over-models(Vd[t])/n!
Let dd[j,k] be the dd which matches aa[j] in model k, the second part
of that becomes:
n! n
sum ( 2*sum (aa[j]*dd[j,k]) /(n-1) ) / n! =
k=1 j=1
2 n! n
-------- * sum ( sum (aa[j]*dd[j,k]) ) =
(n-1)*n! k=1 j=1
2 n n!
-------- * sum ( sum (aa[j]*dd[j,k]) ) =
(n-1)*n! j=1 k=1
2 n n!
-------- * sum ( aa[j] * sum (dd[j,k]) )
(n-1)*n! j=1 k=1
In the inner sum of this last formula, each of the n dd's appear an
equal ( (n-1)! ) times, so this becomes:
2 n n
-------- * sum ( aa[j] * (n-1)! * sum (dd[i]) )
(n-1)*n! j=1 i=1
But that new inner sum is equal to zero, so that full product is zero
and the sum is zero, so we end up with:
V[t] = V[d] + V[a]
So, in the average, the model dependent parts cancel out and we are
left with the sum of the variances of the arival times and departure
times.
There is one more model I can think of to go -- and it may be the most
useful as a point estimate of the variance.
Topher
|
1710.7 | Final Model. | CADSYS::COOPER | Topher Cooper | Sun Jan 24 1993 11:37 | 113 |
| Sorry it took me so long to get to the last part of this -- I got a bit
backed up.
I think that this model is your best shot at a single "point-estimate"
for the variance. In this case I am going to approach the problem a
little differently -- making different kinds of assumptions.
First off, assume that each departure occurs independently of the
others during the departure period. That is, for each object, there is
a particular probability that it will be transmitted during any
particular sub-segment of the departure period. The function which
determines the probability for each moment (very short sub-segment) of
the departure period is the "distribution" of departure times. We are
assuming that the departure distribution is the same for all the
objects and that it is not influenced by how many, or which, or when
any other objects were transmitted. The variance is a characteristic
of the distribution. When you calculate the variance of a bunch of
departure times what you are doing is estimating the true variance
(generally called the "population" variance statistical jargon).
For purposes of algebraic manipulation the departure time for an object
is assigned a variable. This is a special variable that does not
represent the single, unknown time of departure for some particular
object. Rather it is a "random variable" which contains information
about the entire distribution of the objects' departures. The
non-random variables for the departures are what I previously called
"d[i]" for the i'th object. To distinguish I'll call the *random*
variable "D[i]".
The second assumption is that the transportation time (random variable
T[i]) for an object is independent of the departure time for that
object and independent unaffected by the transmissions of all the other
objects. That is probably not "realistic" -- it means that there is
no waiting at any point in the transportation while the mechanism
"finishes" with another object, no rerouting depending on traffic,
etc. But I think that it may well be a not too unreasonable
approximation, in the sense that the inaccuracies are small in terms of
behavior and probably cancel each other out over time.
Let's look at the behavior of a single object with single (though
unspecified) departure and transportation times. The time of the
arrival will be the time of the departure plus the time of the
transmission:
a[i] = d[i] + t[i]
(Note that I am not using random variables in the above). That means
that we can say the same about the random variables:
A[i] = D[i] + T[i]
The addition of random variables produces a random variable as well,
but the process is not a simple addition of single values. The
probability for a time in the "middle" of the distribution associated
with A[i] will be due to many combinations like an early departure but
a long transmission time, or vice versa. (In fact, one of the most
important theorems in statistics says that if you add up enough of
almost *any* random variables the result will have a distribution very
close the the normal distribution -- this is one of the reasons that
the normal distribution is used so much -- but that doesn't help us
with just adding to random variables unless we know that they are both
very close to the normal distribution).
What we want to know, however, is the variances of the random
variables. As it happens, though getting the distribution of A[i] from
D[i] and T[i] is not simple, getting the variance is straight forward.
Regardless of the distributions involved (except that they must have
variances, which is not, in theory always the case for "infinite"
distributions -- which we need not worry about),
The variance of a sum of random variables is equal to the sum
of their variances.
So:
V[A] = V[D] + V[T]
(this is one good reason to work with variances rather than standard
deviations). This also bears on one of your questions:
>A related question is how to recalculate the error when combining
>samples. Will it stay the same or get worse?
We don't need to know V[A], though, we can estimate it, and V[D] from
our observables. What we want is V[T]. So:
V[T] = V[A] - V[D]
And that is the result we are looking for.
It is interesting to note, by our previous formula that this means that
on the average, under these assumptions that:
sum(aa[i]*dd[i])/(n-1) = V[D]
Which is a surprising and unexpected result. Kinda' neat.
One more note:
You may find that if you look at enough samples, you will find that
2
V[D] = W[D] /12
Where W[D] is the "width" of the departure period, i.e., the amount of
time during which transmissions are made. That probably means that
the departure times are pretty close to being distributed uniformly,
i.e., equally likely to occur any time during the transmission period.
In that case, you might find it more convenient (though a tad less
accurate, I think) to just use the above formula, instead of the
measured V[D].
Topher
|
1710.8 | Proof of previous result. | CADSYS::COOPER | Topher Cooper | Mon Jan 25 1993 15:29 | 87 |
| RE: .7 (me)
In the last note I found that under the assumptions of the model
described there that:
> It is interesting to note, by our previous formula that this means that
> on the average, under these assumptions that:
>
> sum(aa[i]*dd[i])/(n-1) = V[D]
This was found by comparing the general formula for variance of
transport times, in terms of matching up departures and arrivals, with
the result of the transport variance from the addition of variances
formula. While this result seemed surprising enough to me for me to
describe it as "kinda' neat", it was also surprising enough to me to
worry that I had made a mistake somewhere. So here is a derivation
of the above relationship directly in terms of our "matching up" view
of things.
Start with the formula from .7.
a[i] = d[i] + t[i]
Since means are linear (the mean of a sum equals the sum of the means),
we also have:
A = D + T
where A, D and T are the mean arrival, departure and transport time.
Also remember that:
a[i] = A + aa[i]
d[i] = D + dd[i]
and by extension:
t[i] = T + tt[i]
So the first formula above becomes, by substitution:
A + aa[i] = D + dd[i] + T + tt[i] ->
D + T + aa[i] = D + dd[i] + T + tt[i] ->
aa[i] = dd[i] + tt[i]
so:
sum( aa[i] * dd[i] )/(n-1) =
sum( (dd[i]+tt[i])*dd[i] )/(n-1) =
(sum(dd[i]�) + sum(dd[i]*tt[i]))/(n-1) =
sum(dd[i]�)/(n-1) + sum(dd[i]*tt[i])/(n-1)
But the expected value of the second summation in that last line is 0
(which I'll demonstrate in a moment). Accepting that for the moment, I
can continue:
= sum(dd[i]�)/(n-1) =
sum( (d[i] - D)� ) / (n-1) = (by definition)
V[D]
So why is the second summation 0? The fast answer I can give is that
dd[i] and tt[i] are independent of each other and in probability theory
that the expected value (average value) of such a cross product is 0
is the definition of independence. But that is really cheating, since
that definition is rather arbitrary until you see why the intuitive
concept of independence would lead to that mathematical property.
For simplicity, lets assume that there are a finite number of possible
dd and a finite number of possible tt values. If I take a large number
of pairs and just look at those pairs which are associated with any
particular dd value (say dd[(j)]) the frequency of each of the tt
values is the same as for any other dd. This is because the
probability that for any particular pair a particular tt comes up does
not depend (i.e., is independent) on which dd is in the pair. So if we
add up a large number of pairs, and group the sum by dd[(j)]'s each
dd[(j)] will be multiplied by a sum of tt[(k)]'s. But the mean value
of the tt[(k)]'s is zero. So on the average, each tt is multiplied by
zero, so the expectated value of:
dd[i]*tt[i]
is zero, and a sum of such values is zero, so:
E[sum(dd[i]*tt[i])/(n-1)] = 0
where E[xxx] is the expected (average) value of xxx.
Topher
|