T.R | Title | User | Personal Name | Date | Lines |
---|
1990.1 | re. .0: Clarification | EVTSG8::ESANU | Au temps pour moi | Tue Aug 29 1995 13:59 | 13 |
| > Using the hypergeometric distribution, the MLE (= Maximum
> Likelihood Estimate) for the total number of errors in
> the program, N , can be shown as:
>
> ^ n(1) * n(2)
> N = floor -----------
> k(1,2)
"Total number of errors in the program" means really total, including
unknown errors, namely errors not discovered by any of the testing teams.
/Mihai.
|
1990.2 | remind me, please... | JOBURG::BUCHANAN | | Mon Sep 11 1995 18:42 | 7 |
| All my books are in storage at the other end of the accessible
universe. Could you remind me of the hypergeometric distribution, so
that I can check that we are working from the same assumptions about
the problem. (e.g. all bug detections equiprobable & independent).
Cheers,
Andrew.
|
1990.3 | Definition of the hypergeometric distribution | EVTSG8::ESANU | Au temps pour moi | Tue Sep 12 1995 10:53 | 34 |
| In a set A consider a partition of it into two subsets A1 and A2 , so
that:
A = A1 /cup A2
A1 /cap A2 = EmptySet
Let
card A = n
card A1 = n1
card A2 = n2
A group of r elements is chosen at random. Let q(k) be the probability
that the group so chosen will contain exactly k elements from the set A1
( k <= min {n1,r} ).
q(k) is shown to be
C(n1, k) * C(n - n1, r - k)
q(k) = ---------------------------
C(n, r)
where C(a, b) are the combinations.
The system of probabilities so defined is called "the hypergeometric
distribution".
The probabilities q(k) are defined only for k <= r, n1 , but since
C(a,b) = 0 whenever b > a, the formula for q(k) above gives q(k) = 0
if either k > n1 or k > r . So the formula above can be employed for all
k >= 0, provided the relation q(k) = 0 is interpreted as impossibility.
You must feel lonely without your books, Andrew! I'd be lost without mine.
Cheers,
Mihai.
|
1990.4 | | WIBBIN::NOYCE | EV5 issues 4 instructions per meter | Tue Sep 12 1995 12:10 | 29 |
| It looks like the formula in .0 says
Divide the set of all bugs into two sets, those found by group 1
and those not found by group 1. Group 2 finds some fraction of
the bugs found by group 1; assume they find the same fraction of
the set of all bugs.
This gives an estimate that satisfies the hypergeometric distribution.
It's also nice in that it's symmetric -- it doesn't matter which group
is called group 1.
One problem is that it assumes all bugs are equally likely to be found
-- not very typical!
To add a third group, you could simply compute three pairwise estimates
and average them somehow. Or you could lump the bugs found by groups
1 and 2 into a single group, and apply the formula in .0 to that group
and group 3 -- does doing this all three ways come up with different
results? These approaches are all based on the original assumption.
But it seems more practical to use extra groups to challenge the assumption
that all bugs are equally likely. Perhaps throw out any bugs found by
all three groups, and apply .0's formula pairwise to the rest (this shrinks
the intersection and so increases the estimate), and finally add in the
common bugs. If each group finds the same set of bugs, it seems plausible
that there are no more; if each group finds some unique bugs but the only
intersection is the bugs found by all three, it seems plausible that there
are a whole lot more left to be found (the formula reduces to 1/0).
|
1990.5 | probability of bug finding | CSSE::NEILSEN | Wally Neilsen-Steinhardt | Wed Sep 13 1995 13:37 | 24 |
| .4> But it seems more practical to use extra groups to challenge the assumption
> that all bugs are equally likely.
I would agree, but I would say "test the hypothesis"
I thought, years ago, about this problem. I can formulate it, but the math is
beyond me. Maybe somebody here can work through to an analytic solution, or
maybe it is worth the trouble to simulate it.
Assume that all bugs are not equally likely to be found, that is, the
probability of detecting a bug varies with the bug. Number the bugs in order of
decreasing probability. Assume a simple variation
P(n,i) = Ai * exp( - Bi * n ) for n=1 to nmax
Where P(n,i) is the probability of finding bug n during test i, Ai and Bi are
constants dependent on test i.
The hypothesis that all bugs are equally likely is given by Bi=0.
The problem is to calculate Ai, Bi and nmax with confidence intervals (or
equivalently the posterior distributions of these variables) given the number of
bugs found in various tests and the number of identical bugs found in multiple
tests.
|
1990.6 | some musings - no answers | AUSSIE::GARSON | achtentachtig kacheltjes | Wed Sep 13 1995 20:19 | 56 |
| The quintessential example of Hypergeometric distribution is fish population
estimation.
Suppose we wish to estimate the total number of fish in some closed
environment.
Let N be the (unknown) total number of fish. Capture alive a certain number of
fish m, tag them, let's say with a blue tag, and release them. We then capture
a further n fish and count the number of these, X, that are tagged.
P(X=x) = C(m,x)C(N-m,n-x)/C(N,n)
X is a random variable and I will write that X ~ HG(N,m,n).
It can be shown that
mean[X] = nm/N
or
mean[X] = np where p = m/N
and that
variance[X] = npq (N-n)/(N-1) where q = 1-p
If N >> n then the hypergeometric distribution can be approximated by the
binomial distribution. My Statistics 1H� notes, which I dredged out last night,
suggest that this approximation is reasonable for N >= 10n. Unfortunately in
the context of the base note that corresponds to a bug find rate of not more
than 10% which I wouldn't consider either desirable or satisfactory. It is of
course quite reasonable in sampling fish populations.
In any case we can use the formula for the mean to derive an estimator for N,
normally written as N with a ^ above it but I'll write it here as N'.
N' = nm/X
If we imagine tagging the second capture with a red tag and releasing them
again, we now have three colours of tagged fish viz. blue, red and blue-and-
red. If we then capture a third time we get three random variables depending on
what colour we count. However these random variables are taken from
distributions *different* from X and different from each other, hence I would
advise caution in just combining them.
Also note that while X ~ HG(...), N' is not. On the other hand 1/N' is at least
proportional to a random variable ~ HG(...). This suggests that perhaps the
harmonic mean is the appropriate means (n.p.i.) of combining multiple results -
or perhaps a weighted harmonic mean - but this is just a guess.
This in no way addresses the concerns that bugs are not all equally likely to
be found. At least with current technology, bugs, unlike fish, do not become
wary and hence less likely to be found the second time. (-:
�1H translates as 1=first year university, H=half subject. My other half
subject was Computing 1H and the rest, as they say, is history.
|
1990.7 | practicalities...? | MOVIES::ROBLES | Russell Robles, EDO 13 | Thu Sep 14 1995 11:59 | 19 |
| re: .0
> The program is given to 2 independent testing teams,
> which find n(1) and n(2) errors respectively,
> k(1,2) of which are common.
>
> Using the hypergeometric distribution, the MLE (= Maximum
> Likelihood Estimate) for the total number of errors in
> the program, N , can be shown as:
>
> ^ n(1) * n(2)
> N = floor -----------
> k(1,2)
This has interesting edge conditions: if team 1 finds a subset of the bugs
found by team 2, we would estimate that team 2'd found all the bugs....no
matter how few team 1 found.
Nothing to do with the problem, of course...
|
1990.8 | fish and bugs not uniformly distributed | RANGER::BRADLEY | Chuck Bradley | Thu Sep 14 1995 14:05 | 28 |
|
re .6, estimating fish population.
there are lots of practical difficulties with fish as well as with bugs.
it works fine for some species and not so fine for others.
basically the assumption is that the fish are uniformly distributed.
most fishermen will argue that is not true.
try the experiment on a species that swims in schools. if the second
dip hits the same school, you get an estimate of the size of the school.
if it hits a different school, you get the same estimate of the population
no matter how many schools there are.
bugs are also not uniformly distributed, hence the common wisdom of
rewriting or inspecting the few modules of sw product that have the
most reported bugs.
even with the limitations, the math is interesting.
it would be nice to compare the estimates to the lower limits provided by
subsequent experience. for example, continue to sample, fix, and compute
until the estimate is less than a suitable limit. then release.
count subsequent problems. if a lot of projects did it, and
shared results, we might get better at software development.
of course some projects of companies will pick the suitable limit
to be the number of bits in the product, or the estimate produced
on the day the product is scheduled to be released.
|
1990.9 | more along the lines of .5 | DECADA::YODER | MFY | Mon Sep 18 1995 13:31 | 33 |
| It might be worthwhile to give a proof of the formula for the mean of X, since I
suspect the technique may be applicable to the harder problem.
Let C(a,b)=0 if a<0 or a>b, where C(a,b)=a!/(b!(a-b)!) otherwise; and let sum(x)
abbreviate sum(x in Z). We have
P(X=x) = C(m,x)C(N-m,n-x)/C(N,n)
The expected value for x is
sum(x) xC(m,x)C(N-m,n-x)/C(N,n)
To evaluate this, note
x m
sum(x) C(m,x)z = (1+z)
x m-1
sum(x) xC(m,x)z = mz(1+z) by differentiating & multiplying by z,
x N-m
sum(x) C(N-m,x)z = (1+z)
N-1 x
The product of these two series is mz(1+z) = sum(x) mC(N-1,x-1)z
Equating terms, sum(x) xC(m,x)C(N-m,n-x) = mC(N-1,n-1)
So the expected value is mC(N-1,n-1)/C(N,n) = mn/N.
My first try for the k-team situation would be to note that x is related to the
quantity t, total number of bugs found by some team, by the formula x = m+n-t in
the case of 2 teams; I'd try, in the k-team case, to get an expected value for t
and then use that to get an estimator for N.
|
1990.10 | conjectures | DECADA::YODER | MFY | Fri Sep 22 1995 13:36 | 34 |
| I've been very busy, and will remain so for the foreseeable future. Maybe
somebody can prove or disprove these?
Conjecture 1: if t is the total number of bugs found by some team, then the
expected value for t/N (given the n(i) and N) is
__
1 - || (1 - n(i)/N)
1<=i<=k
where k is the number of teams, n(i) the number of bugs found by team i, and N
the total number of bugs.
This conjecture is true for k=0, k=1, and k=2. So, by induction, it could
hardly fail to be true for larger k. :-)
To get an estimate for N given t, try letting
__
t/N = 1 - || (1 - n(i)/N).
k-1 k __
tN = N - || (N - n(i))
which leads to N being the root of a polynomial of degree k-1 (if t is not the
sum of the n(i)) and of degree n-2 otherwise. Conjecture: if t is the sum of
the n(i), and k>1, the MLE for N is infinity (speaking loosely).
Conjecture: if t isn't the sum of the n(i), there's only one root r of this
polynomial which is >= the max of the n(i). The MLE is r if r is an integer,
otherwise it's one of the two integers closest to r. For k=2 this is true
according to .0 and moreover the MLE is the smaller integer; I'd be surprised if
this was true generally, but perhaps there is a simple way of expressing which
of the two is the MLE.
|