[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1710.0. "Needed: Slick Statistic Trick" by MARVA2::RAK () Wed Jan 13 1993 11:42

Our customer would like to obtain standard deviations,
but the raw data format doesn't allow it.
Are their any "slick tricks" that would allow us to derive
"approximate" standard deviations in the following problem?

Thanks for any insights, 
Dave Rak MARVA1::RAK  or  DTN 425-3124

First, a definition:
Sample - collection of transactions concerning the movement of
	objects over a single link where they all depart from
	the start of the link within the same interval.

1 INTRODUCTION
The application tracks mail through a network. Existing mail
processing hardware is incapable of identifying individual mail
objects in samples collected for measurement purposes. 

2 AN EXAMPLE
Suppose we collect transactions about a sample of objects
traveling from A through B and C to D.
A, B, C, and D are nodes on the mail network.  
AD is the link.  AB, AC, BC, BD, and CD are sublinks. 

There are n objects in the sample. Each transaction contains a
sample ID, a timestamp, and other information--but no object ID. 
In fact, all these transactions carry the same sample ID (such as
a modified zip code) and there's no way to associate the
transactions belonging to a single object. 

3 WHAT CAN BE DONE
We can calculate the mean travel time for the link and each of
the sublinks.  Take BC for example. First we convert all B and C
time stamps into relative elapsed times. Then we subtract the sum
of all B times from the sum of all C times and divide the
difference by n (the number of objects) to get the mean.  If the
mean is positive, we consider it valid. 

We get many such means as time goes on. For any one link or
sublinks, its means will be normally distributed. If samples vary
in size, we can generate weighted means. But means are the ONLY
statistics we can get out of the original transactions. 

4 WHAT CAN'T BE DONE
We can't calculate the sample variance. As a result, we can't
answer questions involving distribution. 

With the assumption that all objects in the sample start out at
the same time by using a small sample interval we can approximate
variances and histograms for the link and those sublinks which
begin with the first node (in our example, A).  These
approximations may or may not be acceptable. But we can't force
the same assumption on the other sublinks since the error is
compounded at each node after the first. 

5 ALTERNATIVES
   1. Upgrade the mail processing hardware to make it possible to
      identify individual objects - expensive :(
   2. Run a contest to find the best software solution.

T.R	Title	User	Personal Name	Date	Lines
1710.1	My shot.	CADSYS::COOPER	Topher Cooper	`Wed Jan 13 1993 15:46`	40
	Well I'm not sure I understand the problem, but I'll take a shot. If you can assume that some number of samples are drawn from the same distribution (i.e., same mean and variance) then you can calculate the variance of each sample from the variance of the means around their common mean. I suspect that you are looking for the variance of each sample, however, and there really isn't enough information for this. If I understand what you are telling us, you have a bunch of transmission times and a bunch of reception times but no way of identifying corresponding times. Now imagine the following sample: You have transmission times of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10. You have reception times of 60, 61, 62, 63, 64, 65, 66, 67, 68, 69 and 70. The mean time is 60. Now, if the channel is FIFO, so that the 0 tranmission corresponds to the 60 reception, the 1 transmission corresponds to the 61 reception, and so on. Each transmission takes exactly 60 (I'm ignoring rounding effects) so the variance is 0. But, if the channel is LIFO, so the correspondences are 0:70, 1:69, 2:68, etc. then the variance works out to 44. Which is considerably different from the FIFO case. Since the data really is ambiguous no computational trick will allow us to pull an unambiguous variance out of the hat. These two cases do, however, provide upper and lower bounds which might or might not be good enough for your purposes. The physical situation is such that the lower bound is likely to be much closer to the true value than the upper bound. Perhaps you have enough knowledge of the process (e.g., an estimate of how often objects cross in transmission. Then we could do better. What have I totally misunderstood? Topher
1710.2	Bullseye !	MARVA2::RAK		`Wed Jan 13 1993 17:31`	43
	You understood the problem exactly ! And, you've taught me how we can determine upper and lower bounds for the variance. I'll present it to our customer. Any comments on the following approach? It doesn't fully comply with our requirements. I don't have any ideas (yet) on how to squeeze out more statistics out than you've outlined while staying strictly within the bounds of the problem. BUT, I have an idea that may be useful if some of the "other information" associated with each object is invariant during the object's life and allows discrimination between classes of objects. That is, we don't have a full object-id but we can obtain several "classes" of objects. EXAMPLE: Imagine a sample with 6 objects that can be placed into 3 classes (c1, c2, c3). Mean travel times can be computed by groupings of classes for each link/sublink. Group: g1 g2 g3 g4 g5 g6 g7 Classes: c1 c2 c3 c1+c2 c1+c3 c2+c3 c1+c2+c3 X value: 2 2 2 1 1 1 0 x = number of classes in final group (3) - number of classes in current group y = absolute value of [current group mean - final group mean)] The shape and slope of the curve is related to the variance. (I expect the curve will asymptotically approach some value that can be correlated to the true variance.) If all y values are 0, the variance is (intuitively) lower than cases where the y value increases wildly with each smaller subset of the entire sample. I don't know the best way to proceed mathematically. Once could find a function that best describes the relationship between X and Y and extrapolate to the x value where each class consists of a single object.
1710.3	Finding error in aproximate variance	MARVA1::RAK		`Fri Jan 15 1993 09:34`	24
	Posted for a fellow employee who can't get to Notes now: I need to establish the bounds of the error in approximating sample variances. Again, I have two sets of transactions, one consisting of true departure times and the other consisting of true arrival times. But there is no way to associate the departure transaction of an object with the arrival transaction of the same object. I can calculate the EXACT mean of the sample. From the sample mean, I want to approximate the sample variance by assuming all objects depart at the same time. For this, I choose the median of the true departure times. What is the approximation error as a percentage of the true variance? For concreteness, use two hours as the range of the true departure times, five days as the sample mean, and normal distribution for the arrival times. A general formula will be appreciated. A related question is how to recalculate the error when combining samples. Will it stay the same or get worse? This is not a hypothetical situation. My client's existing equipment can't identify individual objects in a sample unless they spend a fortune upgrading it.
1710.4	First stab part I.	CADSYS::COOPER	Topher Cooper	`Fri Jan 15 1993 18:16`	81
	RE: .3 OK. Let's give this a shot. Right off I should say that I don't think that pretending that all the departures are at the median departure is a very good choice of estimates unless the arival times are much more spread out than the departure times, in which case it doesn't matter much what model you use. The bottom line is that there is no firm answer to this. To know what the "true" error is I would have to have a model of the transport process. One such (not totally unreasonable) model is that the transport process acts like a FIFO queue: the objects are transported on a strictly first come first serve basis. Another (probably not particularly reasonable) model is that the transport process acts like a LIFO queue: the objects are collected somewhere and then sent out in reverse order of arival with order strictly maintained before and after the storage. As I said before these form bounding cases: upper and lower bounds for the true sample variance. There are also some other models we can look at. But first lets develop some formulas which will help us compare (warning I'm basically doing this as I write -- it is thus very much subject to dumb algebraic mistakes, check carefully before using). Assume, for the outset, that we do know what arrival times go with which departure times for our n objects. Arranged in no particular order we can call the departure time for the i'th object d[i] and its arrival time a[i]. The transport time t[i] = a[i] - d[i]. First off we want the mean transport time, T. We find that T = A - D where A is the mean arrival time and D is the mean departure time. This does not depend on which a goes with which d so this can be calculated in a model independent way. Now we are interested in the sample variance. Let V[t] be the variance of the transport times, V[a] be the variance of the arrival times and V[d] be the variance of the departure times. Then V[t] = V[d] + V[a] - 2{sum(a[i]d[i]) - nAD}/(n-1) We can split this into a model independent part: Vi[t] = V[d] + V[a] + 2nAD/(n-1) and a model dependent part: Vd[t] = 2sum(a[i]d[i])/(n-1) V[t] = Vi[t] - Vd[t] Let's look at a variation of the compuational method you used. Its the same except we will use the mean, D, rather than the median of the departure times. We can use the same formula as above except all the d[i]s = D, and V[d] = 0. We get: V'[t] = 0 + V[a] + 2nAD/(n-1) - 2sum(a[i]D)/(n-1) = V[a] + 2nAD/(n-1) - 2Dsum(a[i])/(n-1) = V[a] + 2nAD/(n-1) - 2Dn*A/(n-1) = V[a] This isn't too surprising result, basically if we assume that there is no variation in the departure time, all the variance of the arrival time is due to the variance of the transportation. If we use a constant value other than D, say D+O, then we get V'[t] = V[a] + 2AO/(n-1) we get some additional variance. This simply reflects that we are, in a sense taking a variance around something other than the mean, and the mean (generally refered to as simply the variance) is the point which produces the minimal variance. So using the median will add a factor but it doesn't really mean much. It might actually improve the estimate, but only because it adds back in a bit which may compensate for ignoring the variance of d (in fact, the expected O for the median will be, as I remember, proportional to V[d] with a proportionality which depends on the distribution of d, and a bias which depends on the skew of the distribution of d). Time to go home. More later. Topher
1710.5	Improved formula	CADSYS::COOPER	Topher Cooper	`Mon Jan 18 1993 14:36`	61
	RE: .4 Driving home from work Friday, I thought of an improvement in the equation I posted in .4. Start with the same formula for the variance of the transport times: V[t] = V[d] + V[a] - 2{sum(a[i]d[i]) - nAD}/(n-1) Define quantities aa[i] and dd[i] such that: a[i] = A + aa[i] d[i] = D + dd[i] That is, aa and dd are the offsets from the mean arrival and departure times respectively. If we substitute these for the part of the equation above inside the {} we get: ... sum(a[i]d[i]) - nAD ... = ... sum((A+aa[i])(D+dd[i]) - nAD ... = ... sum(AD) + sum(Add[i]) + sum(Daa[i]) + sum(aa[i]dd[i]) - nAD ... = ... nAD + Asum(dd[i]) + Dsum(aa[i]) + sum(aa[i]dd[i]) - nAD ... But sum(dd[i]) = sum(aa[i]) = 0, so this comes to: ... sum(aa[i]dd[i]) ... Now if we split the equation into model independent and dependent parts we get the simpler: Vi[t] = V[d] + V[a] Vd[t] = 2sum(aa[i]dd[i])/(n-1) V[t] = Vi[t] - Vd[t] This form makes it easier to see mathematically, why the FIFO and LIFO cases are lower and upper bounds in the variance. Think of each offset (aa[i] and dd[i]) as cosisting of a sign (+/-) and a magnitude. Take the magnitudes of the aa's as weights to determine how much to "count" the value of each dd magnitude in the sum. Both FIFO and LIFO put the maximum weights on the large magnitude values. The big values get counted the most, while the small weights are put on the small values. You get the same effect, of course, if you make dd the weights and aa the values to be weighted. In the FIFO case, you get the signs matching: + with + and - with -. (If the number of -'s aren't the same, the mismatch, where + goes with -, will be in the "middle" which is where the weights are low and there is therefore little effect). The products will therefore all be positive (except possibly on the small values, as I said) and the sum will be positive. Since the sum is subtracted from the model indendent part, this will produce a minimum possible value. In the LIFO case, the signs will criss-cross: + with - and - with +. The products will all (except, possibally for some small values) be negative and the sum will be negative. This negative sum will be subtracted -- which means we are "really" adding -- and so a maximum variance will be produced. Topher
1710.6	First In Random Out	CADSYS::COOPER	Topher Cooper	`Mon Jan 18 1993 16:20`	60
	Now I'm going to discuss another possible model for transport. The assumption made in this model is that the transport mechanism essentially ignores the time or order in which the departures are made. This might be the case if, for example, the objects sent out in any particular observation window are stored and sent out all at once. Another way of viewing this particular model is that it is an "average" model: it represents the "average" behavior of all the different possible models, weighing each one equally. The expected variance is the average variance for each possible way that input objects might be associated with output objects. There are n! such combinations. In finding the average of these variances, we add up the V[t]'s for each model and divide by n!. The model independent part is going to be the same each time, so it will be added n! times then the result will be divided by n! so we get: V[t] = Vi[t] - sum-over-models(Vd[t])/n! Let dd[j,k] be the dd which matches aa[j] in model k, the second part of that becomes: n! n sum ( 2sum (aa[j]dd[j,k]) /(n-1) ) / n! = k=1 j=1 2 n! n -------- * sum ( sum (aa[j]dd[j,k]) ) = (n-1)n! k=1 j=1 2 n n! -------- * sum ( sum (aa[j]dd[j,k]) ) = (n-1)n! j=1 k=1 2 n n! -------- * sum ( aa[j] * sum (dd[j,k]) ) (n-1)n! j=1 k=1 In the inner sum of this last formula, each of the n dd's appear an equal ( (n-1)! ) times, so this becomes: 2 n n -------- sum ( aa[j] * (n-1)! * sum (dd[i]) ) (n-1)*n! j=1 i=1 But that new inner sum is equal to zero, so that full product is zero and the sum is zero, so we end up with: V[t] = V[d] + V[a] So, in the average, the model dependent parts cancel out and we are left with the sum of the variances of the arival times and departure times. There is one more model I can think of to go -- and it may be the most useful as a point estimate of the variance. Topher
1710.7	Final Model.	CADSYS::COOPER	Topher Cooper	`Sun Jan 24 1993 11:37`	113
	Sorry it took me so long to get to the last part of this -- I got a bit backed up. I think that this model is your best shot at a single "point-estimate" for the variance. In this case I am going to approach the problem a little differently -- making different kinds of assumptions. First off, assume that each departure occurs independently of the others during the departure period. That is, for each object, there is a particular probability that it will be transmitted during any particular sub-segment of the departure period. The function which determines the probability for each moment (very short sub-segment) of the departure period is the "distribution" of departure times. We are assuming that the departure distribution is the same for all the objects and that it is not influenced by how many, or which, or when any other objects were transmitted. The variance is a characteristic of the distribution. When you calculate the variance of a bunch of departure times what you are doing is estimating the true variance (generally called the "population" variance statistical jargon). For purposes of algebraic manipulation the departure time for an object is assigned a variable. This is a special variable that does not represent the single, unknown time of departure for some particular object. Rather it is a "random variable" which contains information about the entire distribution of the objects' departures. The non-random variables for the departures are what I previously called "d[i]" for the i'th object. To distinguish I'll call the random variable "D[i]". The second assumption is that the transportation time (random variable T[i]) for an object is independent of the departure time for that object and independent unaffected by the transmissions of all the other objects. That is probably not "realistic" -- it means that there is no waiting at any point in the transportation while the mechanism "finishes" with another object, no rerouting depending on traffic, etc. But I think that it may well be a not too unreasonable approximation, in the sense that the inaccuracies are small in terms of behavior and probably cancel each other out over time. Let's look at the behavior of a single object with single (though unspecified) departure and transportation times. The time of the arrival will be the time of the departure plus the time of the transmission: a[i] = d[i] + t[i] (Note that I am not using random variables in the above). That means that we can say the same about the random variables: A[i] = D[i] + T[i] The addition of random variables produces a random variable as well, but the process is not a simple addition of single values. The probability for a time in the "middle" of the distribution associated with A[i] will be due to many combinations like an early departure but a long transmission time, or vice versa. (In fact, one of the most important theorems in statistics says that if you add up enough of almost any random variables the result will have a distribution very close the the normal distribution -- this is one of the reasons that the normal distribution is used so much -- but that doesn't help us with just adding to random variables unless we know that they are both very close to the normal distribution). What we want to know, however, is the variances of the random variables. As it happens, though getting the distribution of A[i] from D[i] and T[i] is not simple, getting the variance is straight forward. Regardless of the distributions involved (except that they must have variances, which is not, in theory always the case for "infinite" distributions -- which we need not worry about), The variance of a sum of random variables is equal to the sum of their variances. So: V[A] = V[D] + V[T] (this is one good reason to work with variances rather than standard deviations). This also bears on one of your questions: >A related question is how to recalculate the error when combining >samples. Will it stay the same or get worse? We don't need to know V[A], though, we can estimate it, and V[D] from our observables. What we want is V[T]. So: V[T] = V[A] - V[D] And that is the result we are looking for. It is interesting to note, by our previous formula that this means that on the average, under these assumptions that: sum(aa[i]*dd[i])/(n-1) = V[D] Which is a surprising and unexpected result. Kinda' neat. One more note: You may find that if you look at enough samples, you will find that 2 V[D] = W[D] /12 Where W[D] is the "width" of the departure period, i.e., the amount of time during which transmissions are made. That probably means that the departure times are pretty close to being distributed uniformly, i.e., equally likely to occur any time during the transmission period. In that case, you might find it more convenient (though a tad less accurate, I think) to just use the above formula, instead of the measured V[D]. Topher
1710.8	Proof of previous result.	CADSYS::COOPER	Topher Cooper	`Mon Jan 25 1993 15:29`	87
	RE: .7 (me) In the last note I found that under the assumptions of the model described there that: > It is interesting to note, by our previous formula that this means that > on the average, under these assumptions that: > > sum(aa[i]dd[i])/(n-1) = V[D] This was found by comparing the general formula for variance of transport times, in terms of matching up departures and arrivals, with the result of the transport variance from the addition of variances formula. While this result seemed surprising enough to me for me to describe it as "kinda' neat", it was also surprising enough to me to worry that I had made a mistake somewhere. So here is a derivation of the above relationship directly in terms of our "matching up" view of things. Start with the formula from .7. a[i] = d[i] + t[i] Since means are linear (the mean of a sum equals the sum of the means), we also have: A = D + T where A, D and T are the mean arrival, departure and transport time. Also remember that: a[i] = A + aa[i] d[i] = D + dd[i] and by extension: t[i] = T + tt[i] So the first formula above becomes, by substitution: A + aa[i] = D + dd[i] + T + tt[i] -> D + T + aa[i] = D + dd[i] + T + tt[i] -> aa[i] = dd[i] + tt[i] so: sum( aa[i] dd[i] )/(n-1) = sum( (dd[i]+tt[i])dd[i] )/(n-1) = (sum(dd[i]�) + sum(dd[i]tt[i]))/(n-1) = sum(dd[i]�)/(n-1) + sum(dd[i]tt[i])/(n-1) But the expected value of the second summation in that last line is 0 (which I'll demonstrate in a moment). Accepting that for the moment, I can continue: = sum(dd[i]�)/(n-1) = sum( (d[i] - D)� ) / (n-1) = (by definition) V[D] So why is the second summation 0? The fast answer I can give is that dd[i] and tt[i] are independent of each other and in probability theory that the expected value (average value) of such a cross product is 0 is the definition of independence. But that is really cheating, since that definition is rather arbitrary until you see why the intuitive concept of independence would lead to that mathematical property. For simplicity, lets assume that there are a finite number of possible dd and a finite number of possible tt values. If I take a large number of pairs and just look at those pairs which are associated with any particular dd value (say dd[(j)]) the frequency of each of the tt values is the same as for any other dd. This is because the probability that for any particular pair a particular tt comes up does not depend (i.e., is independent) on which dd is in the pair. So if we add up a large number of pairs, and group the sum by dd[(j)]'s each dd[(j)] will be multiplied by a sum of tt[(k)]'s. But the mean value of the tt[(k)]'s is zero. So on the average, each tt is multiplied by zero, so the expectated value of: dd[i]tt[i] is zero, and a sum of such values is zero, so: E[sum(dd[i]*tt[i])/(n-1)] = 0 where E[xxx] is the expected (average) value of xxx. Topher