| T.R | Title | User | Personal Name
 | Date | Lines | 
|---|
| 1710.1 | My shot. | CADSYS::COOPER | Topher Cooper | Wed Jan 13 1993 15:46 | 40 | 
|  |     Well I'm not sure I understand the problem, but I'll take a shot.
    If you can assume that some number of samples are drawn from the same
    distribution (i.e., same mean and variance) then you can calculate the
    variance of each sample from the variance of the means around their
    common mean.
    I suspect that you are looking for the variance of each sample,
    however, and there really isn't enough information for this.
    If I understand what you are telling us, you have a bunch of
    transmission times and a bunch of reception times but no way of
    identifying corresponding times.  Now imagine the following sample:
    You have transmission times of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10.
    You have reception times of 60, 61, 62, 63, 64, 65, 66, 67, 68, 69 and
    70.  The mean time is 60.
    Now, if the channel is FIFO, so that the 0 tranmission corresponds to
    the 60 reception, the 1 transmission corresponds to the 61 reception,
    and so on.  Each transmission takes exactly 60 (I'm ignoring rounding
    effects) so the variance is 0.
    But, if the channel is LIFO, so the correspondences are 0:70, 1:69,
    2:68, etc. then the variance works out to 44.  Which is considerably
    different from the FIFO case.
    Since the data really is ambiguous no computational trick will allow us
    to pull an unambiguous variance out of the hat.  These two cases do,
    however, provide upper and lower bounds which might or might not be
    good enough for your purposes.  The physical situation is such that
    the lower bound is likely to be much closer to the true value than the
    upper bound.
    Perhaps you have enough knowledge of the process (e.g., an estimate
    of how often objects cross in transmission.  Then we could do better.
    What have I totally misunderstood?
					Topher
 | 
| 1710.2 | Bullseye ! | MARVA2::RAK |  | Wed Jan 13 1993 17:31 | 43 | 
|  |     You understood the problem exactly !
    And, you've taught me how we can determine upper and lower 
    bounds for the variance.  I'll present it to our customer.
    
    Any comments on the following approach? It doesn't 
    fully comply with our requirements.
    
	I don't have any ideas (yet) on how to squeeze out
more statistics out than you've outlined while staying strictly
within the bounds of the problem.  BUT, I have an idea that
may be useful if some of the "other information" associated
with each object is invariant during the object's life
and allows discrimination between classes of objects.
That is, we don't have a full object-id but we can obtain
several  "classes" of objects. 
    
EXAMPLE:
	Imagine a sample with 6 objects that can be placed 
into 3 classes (c1, c2, c3).
Mean travel times can be computed by groupings 
of classes for each link/sublink.
Group:		g1   g2   g3    g4      g5      g6        g7
Classes:	c1   c2   c3   c1+c2   c1+c3   c2+c3   c1+c2+c3
X value:	2     2    2    1       1       1         0
	x = number of classes in final group (3) - 
	    number of classes in current group
	y = absolute value of [current group mean - final group mean)]
The shape and slope of the curve is related to the variance.
(I expect the curve will asymptotically approach some value
that can be correlated to the true variance.)
If all y values are 0, the variance is (intuitively) lower than
cases where the y value increases wildly with each smaller
subset of the entire sample.
I don't know the best way to proceed mathematically.
Once could find a function that best describes the 
relationship between X and Y and extrapolate to the
x value where each class consists of a single object.
 | 
| 1710.3 | Finding error in aproximate variance | MARVA1::RAK |  | Fri Jan 15 1993 09:34 | 24 | 
|  |     Posted for a fellow employee who can't get to Notes now:
    
I need to establish the bounds of the error in approximating
sample variances.  Again, I have two sets of transactions,
one consisting of true departure times and the other consisting of
true arrival times.  But there is no way to associate the departure
transaction of an object with the arrival transaction of the
same object.
I can calculate the EXACT mean of the sample.  From the sample
mean, I want to approximate the sample variance by assuming
all objects depart at the same time.  For this, I choose the
median of the true departure times.  What is the approximation
error as a percentage of the true variance?  For concreteness,
use two hours as the range of the true departure times, five days
as the sample mean, and normal distribution for the arrival times.
A general formula will be appreciated.
A related question is how to recalculate the error when combining
samples.  Will it stay the same or get worse?
This is not a hypothetical situation.  My client's existing
equipment can't identify individual objects in a sample unless
they spend a fortune upgrading it.
 | 
| 1710.4 | First stab part I. | CADSYS::COOPER | Topher Cooper | Fri Jan 15 1993 18:16 | 81 | 
|  | RE: .3
    OK. Let's give this a shot.  Right off I should say that I don't think
    that pretending that all the departures are at the median departure is
    a very good choice of estimates unless the arival times are *much* more
    spread out than the departure times, in which case it doesn't matter
    much what model you use.
    The bottom line is that there is no firm answer to this.  To know what
    the "true" error is I would have to have a model of the transport
    process.  One such (not totally unreasonable) model is that the
    transport process acts like a FIFO queue: the objects are transported
    on a strictly first come first serve basis.  Another (probably not
    particularly reasonable) model is that the transport process acts like
    a LIFO queue: the objects are collected somewhere and then sent out in
    reverse order of arival with order strictly maintained before and after
    the storage.
    As I said before these form bounding cases: upper and lower bounds
    for the true sample variance.  There are also some other models we
    can look at.  But first lets develop some formulas which will help
    us compare (warning I'm basically doing this as I write -- it is
    thus very much subject to dumb algebraic mistakes, check carefully
    before using).
    Assume, for the outset, that we do know what arrival times go with
    which departure times for our n objects.  Arranged in no particular
    order we can call the departure time for the i'th object d[i] and its
    arrival time a[i].  The transport time t[i] = a[i] - d[i].  First
    off we want the mean transport time, T.  We find that T = A - D
    where A is the mean arrival time and D is the mean departure time.
    This does not depend on which a goes with which d so this can be
    calculated in a model independent way.
    Now we are interested in the sample variance.  Let V[t] be the variance
    of the transport times, V[a] be the variance of the arrival times
    and V[d] be the variance of the departure times.  Then
	V[t] = V[d] + V[a] - 2*{sum(a[i]*d[i]) - n*A*D}/(n-1)
    We can split this into a model independent part:
	Vi[t] = V[d] + V[a] + 2*n*A*D/(n-1)
    and a model dependent part:
	Vd[t] = 2*sum(a[i]*d[i])/(n-1)
	V[t] = Vi[t] - Vd[t]
    Let's look at a variation of the compuational method you used.  Its
    the same except we will use the mean, D, rather than the median of the
    departure times.  We can use the same formula as above except all the
    d[i]s = D, and V[d] = 0.  We get:
	V'[t] = 0 + V[a] + 2*n*A*D/(n-1)  - 2*sum(a[i]*D)/(n-1)
	      = V[a] + 2*n*A*D/(n-1) - 2*D*sum(a[i])/(n-1)
	      = V[a] + 2*n*A*D/(n-1) - 2*D*n*A/(n-1)
	      = V[a]
    This isn't too surprising result, basically if we assume that there is
    no variation in the departure time, all the variance of the arrival
    time is due to the variance of the transportation.
    If we use a constant value other than D, say D+O, then we get
	V'[t] = V[a] + 2AO/(n-1)
    we get some additional variance.  This simply reflects that we are, in
    a sense taking a variance around something other than the mean, and the
    mean (generally refered to as simply the variance) is the point which
    produces the minimal variance.  So using the median will add a factor
    but it doesn't really mean much.  It might actually improve the
    estimate, but only because it adds back in a bit which may compensate
    for ignoring the variance of d (in fact, the expected O for the median
    will be, as I remember, proportional to V[d] with a proportionality
    which depends on the distribution of d, and a bias which depends on the
    skew of the distribution of d).
    Time to go home.  More later.
					Topher
 | 
| 1710.5 | Improved formula | CADSYS::COOPER | Topher Cooper | Mon Jan 18 1993 14:36 | 61 | 
|  | RE: .4
    Driving home from work Friday, I thought of an improvement in the
    equation I posted in .4.
    Start with the same formula for the variance of the transport times:
    	V[t] = V[d] + V[a] - 2*{sum(a[i]*d[i]) - n*A*D}/(n-1)
    Define quantities aa[i] and dd[i] such that:
	a[i] = A + aa[i]
	d[i] = D + dd[i]
    That is, aa and dd are the offsets from the mean arrival and departure
    times respectively.  If we substitute these for the part of the
    equation above inside the {} we get:
	... sum(a[i]*d[i]) - n*A*D ... =
	... sum((A+aa[i])*(D+dd[i]) - n*A*D ... =
	... sum(A*D) + sum(A*dd[i]) + sum(D*aa[i]) + sum(aa[i]*dd[i]) -
		n*A*D ... =
	... n*A*D + A*sum(dd[i]) + D*sum(aa[i]) + sum(aa[i]*dd[i]) -
		n*A*D ...
    But sum(dd[i]) = sum(aa[i]) = 0, so this comes to:
	... sum(aa[i]*dd[i]) ...
    Now if we split the equation into model independent and dependent parts
    we get the simpler:
	Vi[t] = V[d] + V[a]
	Vd[t] = 2*sum(aa[i]*dd[i])/(n-1)
	V[t] = Vi[t] - Vd[t]
    This form makes it easier to see mathematically, why the FIFO and LIFO
    cases are lower and upper bounds in the variance.  Think of each offset
    (aa[i] and dd[i]) as cosisting of a sign (+/-) and a magnitude.  Take
    the magnitudes of the aa's as weights to determine how much to "count"
    the value of each dd magnitude in the sum.  Both FIFO and LIFO put the
    maximum weights on the large magnitude values.  The big values get
    counted the most, while the small weights are put on the small values.
    You get the same effect, of course, if you make dd the weights and
    aa the values to be weighted.
    In the FIFO case, you get the signs matching: + with + and - with -.
    (If the number of -'s aren't the same, the mismatch, where + goes with
    -, will be in the "middle" which is where the weights are low and
    there is therefore little effect).  The products will therefore all
    be positive (except possibly on the small values, as I said) and the
    sum will be positive.  Since the sum is subtracted from the model
    indendent part, this will produce a minimum possible value.
    In the LIFO case, the signs will criss-cross: + with - and - with +.
    The products will all (except, possibally for some small values) be
    negative and the sum will be negative.  This negative sum will be
    subtracted -- which means we are "really" adding -- and so a maximum
    variance will be produced.
					Topher
 | 
| 1710.6 | First In Random Out | CADSYS::COOPER | Topher Cooper | Mon Jan 18 1993 16:20 | 60 | 
|  |     Now I'm going to discuss another possible model for transport.
    The assumption made in this model is that the transport mechanism
    essentially ignores the time or order in which the departures are
    made.  This might be the case if, for example, the objects sent out
    in any particular observation window are stored and sent out all
    at once.
    Another way of viewing this particular model is that it is an "average"
    model: it represents the "average" behavior of all the different
    possible models, weighing each one equally.
    The expected variance is the average variance for each possible way
    that input objects might be associated with output objects.  There
    are n! such combinations.  In finding the average of these variances,
    we add up the V[t]'s for each model and divide by n!.  The model
    independent part is going to be the same each time, so it will be
    added n! times then the result will be divided by n! so we get:
	V[t] = Vi[t] - sum-over-models(Vd[t])/n!
    Let dd[j,k] be the dd which matches aa[j] in model k, the second part
    of that becomes:
	 n!      n
	sum ( 2*sum (aa[j]*dd[j,k]) /(n-1) ) / n! =
	k=1     j=1
	    2	    n!    n
	-------- * sum ( sum (aa[j]*dd[j,k]) ) =
	(n-1)*n!   k=1   j=1
	    2	    n     n!
	-------- * sum ( sum (aa[j]*dd[j,k]) ) =
	(n-1)*n!   j=1   k=1
	    2	    n             n!
	-------- * sum ( aa[j] * sum (dd[j,k]) )
	(n-1)*n!   j=1           k=1
    In the inner sum of this last formula, each of the n dd's appear an
    equal ( (n-1)! ) times, so this becomes:
	    2	    n                      n
	-------- * sum ( aa[j] * (n-1)! * sum (dd[i]) )
	(n-1)*n!   j=1                    i=1
    But that new inner sum is equal to zero, so that full product is zero
    and the sum is zero, so we end up with:
	V[t] = V[d] + V[a]
    So, in the average, the model dependent parts cancel out and we are
    left with the sum of the variances of the arival times and departure
    times.
    There is one more model I can think of to go -- and it may be the most
    useful as a point estimate of the variance.
				Topher
 | 
| 1710.7 | Final Model. | CADSYS::COOPER | Topher Cooper | Sun Jan 24 1993 11:37 | 113 | 
|  |     Sorry it took me so long to get to the last part of this -- I got a bit
    backed up.
    I think that this model is your best shot at a single "point-estimate"
    for the variance.  In this case I am going to approach the problem a
    little differently -- making different kinds of assumptions.
    First off, assume that each departure occurs independently of the
    others during the departure period.  That is, for each object, there is
    a particular probability that it will be transmitted during any
    particular sub-segment of the departure period.  The function which
    determines the probability for each moment (very short sub-segment) of
    the departure period is the "distribution" of departure times. We are
    assuming that the departure distribution is the same for all the
    objects and that it is not influenced by how many, or which, or when
    any other objects were transmitted.  The variance is a characteristic
    of the distribution.  When you calculate the variance of a bunch of
    departure times what you are doing is estimating the true variance
    (generally called the "population" variance statistical jargon).
    For purposes of algebraic manipulation the departure time for an object
    is assigned a variable.  This is a special variable that does not
    represent the single, unknown time of departure for some particular
    object.  Rather it is a "random variable" which contains information
    about the entire distribution of the objects' departures.  The
    non-random variables for the departures are what I previously called
    "d[i]" for the i'th object.  To distinguish I'll call the *random*
    variable "D[i]".
    The second assumption is that the transportation time (random variable
    T[i]) for an object is independent of the departure time for that
    object and independent unaffected by the transmissions of all the other
    objects.  That is probably not "realistic" -- it means that there is
    no waiting at any point in the transportation while the mechanism
    "finishes" with another object, no rerouting depending on traffic,
    etc.  But I think that it may well be a not too unreasonable
    approximation, in the sense that the inaccuracies are small in terms of
    behavior and probably cancel each other out over time.
    Let's look at the behavior of a single object with single (though
    unspecified) departure and transportation times.  The time of the
    arrival will be the time of the departure plus the time of the
    transmission:
	    a[i] = d[i] + t[i]
    (Note that I am not using random variables in the above).  That means
    that we can say the same about the random variables:
	    A[i] = D[i] + T[i]
    The addition of random variables produces a random variable as well,
    but the process is not a simple addition of single values.  The
    probability for a time in the "middle" of the distribution associated
    with A[i] will be due to many combinations like an early departure but
    a long transmission time, or vice versa.  (In fact, one of the most
    important theorems in statistics says that if you add up enough of
    almost *any* random variables the result will have a distribution very
    close the the normal distribution -- this is one of the reasons that
    the normal distribution is used so much -- but that doesn't help us
    with just adding to random variables unless we know that they are both
    very close to the normal distribution).
    What we want to know, however, is the variances of the random
    variables.  As it happens, though getting the distribution of A[i] from
    D[i] and T[i] is not simple, getting the variance is straight forward.
    Regardless of the distributions involved (except that they must have
    variances, which is not, in theory always the case for "infinite"
    distributions -- which we need not worry about),
	The variance of a sum of random variables is equal to the sum
	of their variances.
    So:
	    V[A] = V[D] + V[T]
    (this is one good reason to work with variances rather than standard
    deviations).  This also bears on one of your questions:
>A related question is how to recalculate the error when combining
>samples.  Will it stay the same or get worse?
    We don't need to know V[A], though, we can estimate it, and V[D] from
    our observables.  What we want is V[T].  So:
	    V[T] = V[A] - V[D]
    And that is the result we are looking for.
    It is interesting to note, by our previous formula that this means that
    on the average, under these assumptions that:
	    sum(aa[i]*dd[i])/(n-1) = V[D]
    Which is a surprising and unexpected result.  Kinda' neat.
    One more note:
    You may find that if you look at enough samples, you will find that
	               2
	    V[D] = W[D] /12
    Where W[D] is the "width" of the departure period, i.e., the amount of
    time during which transmissions are made.  That probably means that
    the departure times are pretty close to being distributed uniformly,
    i.e., equally likely to occur any time during the transmission period.
    In that case, you might find it more convenient (though a tad less
    accurate, I think) to just use the above formula, instead of the
    measured V[D].
				    Topher
 | 
| 1710.8 | Proof of previous result. | CADSYS::COOPER | Topher Cooper | Mon Jan 25 1993 15:29 | 87 | 
|  | RE: .7 (me)
    In the last note I found that under the assumptions of the model
    described there that:
>    It is interesting to note, by our previous formula that this means that
>    on the average, under these assumptions that:
>
>	    sum(aa[i]*dd[i])/(n-1) = V[D]
    This was found by comparing the general formula for variance of
    transport times, in terms of matching up departures and arrivals, with
    the result of the transport variance from the addition of variances
    formula.  While this result seemed surprising enough to me for me to
    describe it as "kinda' neat", it was also surprising enough to me to
    worry that I had made a mistake somewhere.  So here is a derivation
    of the above relationship directly in terms of our "matching up" view
    of things.
    Start with the formula from .7.
	    a[i] = d[i] + t[i]
    Since means are linear (the mean of a sum equals the sum of the means),
    we also have:
	    A = D + T
    where A, D and T are the mean arrival, departure and transport time.
    Also remember that:
	    a[i] = A + aa[i]
	    d[i] = D + dd[i]
    and by extension:
	    t[i] = T + tt[i]
    So the first formula above becomes, by substitution:
	    A + aa[i] = D + dd[i] + T + tt[i] ->
	    D + T + aa[i] = D + dd[i] + T + tt[i] ->
	    aa[i] = dd[i] + tt[i]
    so:
    
	    sum( aa[i] * dd[i] )/(n-1) =
	    sum( (dd[i]+tt[i])*dd[i] )/(n-1) =
	    (sum(dd[i]�) + sum(dd[i]*tt[i]))/(n-1) =
	    sum(dd[i]�)/(n-1) + sum(dd[i]*tt[i])/(n-1)
    But the expected value of the second summation in that last line is 0
    (which I'll demonstrate in a moment).  Accepting that for the moment, I
    can continue:
	  = sum(dd[i]�)/(n-1) =
	    sum( (d[i] - D)� ) / (n-1) = (by definition)
	    V[D]
    So why is the second summation 0?  The fast answer I can give is that
    dd[i] and tt[i] are independent of each other and in probability theory
    that the expected value (average value) of such a cross product is 0
    is the definition of independence.  But that is really cheating, since
    that definition is rather arbitrary until you see why the intuitive
    concept of independence would lead to that mathematical property.
    For simplicity, lets assume that there are a finite number of possible
    dd and a finite number of possible tt values.  If I take a large number
    of pairs and just look at those pairs which are associated with any
    particular dd value (say dd[(j)]) the frequency of each of the tt
    values is the same as for any other dd.  This is because the
    probability that for any particular pair a particular tt comes up does
    not depend (i.e., is independent) on which dd is in the pair.  So if we
    add up a large number of pairs, and group the sum by dd[(j)]'s each
    dd[(j)] will be multiplied by a sum of tt[(k)]'s.  But the mean value
    of the tt[(k)]'s is zero.  So on the average, each tt is multiplied by
    zero, so the expectated value of:
	    dd[i]*tt[i]
    is zero, and a sum of such values is zero, so:
	    E[sum(dd[i]*tt[i])/(n-1)] = 0
    where E[xxx] is the expected (average) value of xxx.
				    Topher
 |