[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1990.0. "Evaluating hidden bugs" by EVTSG8::ESANU (Au temps pour moi) Tue Aug 29 1995 04:26

Hi,

Software reliability estimation has been thoroughly studied, as it
obviously is one of the most sensitive issues in using software tools.

There are subtle and also simple methods to assess software reliability. 

One of the simplest ones is the one presented by Rudner:

	The program is given to 2 independent testing teams,
	which find  n(1)  and  n(2)  errors respectively, 
	k(1,2)  of which are common.

	Using the hypergeometric distribution, the MLE (= Maximum
	Likelihood Estimate) for the total number of errors in
	the program,  N , can be shown as:

	^         n(1) * n(2)
	N = floor -----------
                    k(1,2)

(where  floor (x)  is the greatest integer less than or equal to the real
number  x ).

How would you generalize it to  p  independent testing teams?

/Mihai.

T.R	Title	User	Personal Name	Date	Lines
1990.1	re. .0: Clarification	EVTSG8::ESANU	Au temps pour moi	`Tue Aug 29 1995 12:59`	13
	> Using the hypergeometric distribution, the MLE (= Maximum > Likelihood Estimate) for the total number of errors in > the program, N , can be shown as: > > ^ n(1) * n(2) > N = floor ----------- > k(1,2) "Total number of errors in the program" means really total, including unknown errors, namely errors not discovered by any of the testing teams. /Mihai.
1990.2	remind me, please...	JOBURG::BUCHANAN		`Mon Sep 11 1995 17:42`	7
	All my books are in storage at the other end of the accessible universe. Could you remind me of the hypergeometric distribution, so that I can check that we are working from the same assumptions about the problem. (e.g. all bug detections equiprobable & independent). Cheers, Andrew.
1990.3	Definition of the hypergeometric distribution	EVTSG8::ESANU	Au temps pour moi	`Tue Sep 12 1995 09:53`	34
	In a set A consider a partition of it into two subsets A1 and A2 , so that: A = A1 /cup A2 A1 /cap A2 = EmptySet Let card A = n card A1 = n1 card A2 = n2 A group of r elements is chosen at random. Let q(k) be the probability that the group so chosen will contain exactly k elements from the set A1 ( k <= min {n1,r} ). q(k) is shown to be C(n1, k) * C(n - n1, r - k) q(k) = --------------------------- C(n, r) where C(a, b) are the combinations. The system of probabilities so defined is called "the hypergeometric distribution". The probabilities q(k) are defined only for k <= r, n1 , but since C(a,b) = 0 whenever b > a, the formula for q(k) above gives q(k) = 0 if either k > n1 or k > r . So the formula above can be employed for all k >= 0, provided the relation q(k) = 0 is interpreted as impossibility. You must feel lonely without your books, Andrew! I'd be lost without mine. Cheers, Mihai.
1990.4		WIBBIN::NOYCE	EV5 issues 4 instructions per meter	`Tue Sep 12 1995 11:10`	29
	It looks like the formula in .0 says Divide the set of all bugs into two sets, those found by group 1 and those not found by group 1. Group 2 finds some fraction of the bugs found by group 1; assume they find the same fraction of the set of all bugs. This gives an estimate that satisfies the hypergeometric distribution. It's also nice in that it's symmetric -- it doesn't matter which group is called group 1. One problem is that it assumes all bugs are equally likely to be found -- not very typical! To add a third group, you could simply compute three pairwise estimates and average them somehow. Or you could lump the bugs found by groups 1 and 2 into a single group, and apply the formula in .0 to that group and group 3 -- does doing this all three ways come up with different results? These approaches are all based on the original assumption. But it seems more practical to use extra groups to challenge the assumption that all bugs are equally likely. Perhaps throw out any bugs found by all three groups, and apply .0's formula pairwise to the rest (this shrinks the intersection and so increases the estimate), and finally add in the common bugs. If each group finds the same set of bugs, it seems plausible that there are no more; if each group finds some unique bugs but the only intersection is the bugs found by all three, it seems plausible that there are a whole lot more left to be found (the formula reduces to 1/0).
1990.5	probability of bug finding	CSSE::NEILSEN	Wally Neilsen-Steinhardt	`Wed Sep 13 1995 12:37`	24
	.4> But it seems more practical to use extra groups to challenge the assumption > that all bugs are equally likely. I would agree, but I would say "test the hypothesis" I thought, years ago, about this problem. I can formulate it, but the math is beyond me. Maybe somebody here can work through to an analytic solution, or maybe it is worth the trouble to simulate it. Assume that all bugs are not equally likely to be found, that is, the probability of detecting a bug varies with the bug. Number the bugs in order of decreasing probability. Assume a simple variation P(n,i) = Ai * exp( - Bi * n ) for n=1 to nmax Where P(n,i) is the probability of finding bug n during test i, Ai and Bi are constants dependent on test i. The hypothesis that all bugs are equally likely is given by Bi=0. The problem is to calculate Ai, Bi and nmax with confidence intervals (or equivalently the posterior distributions of these variables) given the number of bugs found in various tests and the number of identical bugs found in multiple tests.
1990.6	some musings - no answers	AUSSIE::GARSON	achtentachtig kacheltjes	`Wed Sep 13 1995 19:19`	56
	The quintessential example of Hypergeometric distribution is fish population estimation. Suppose we wish to estimate the total number of fish in some closed environment. Let N be the (unknown) total number of fish. Capture alive a certain number of fish m, tag them, let's say with a blue tag, and release them. We then capture a further n fish and count the number of these, X, that are tagged. P(X=x) = C(m,x)C(N-m,n-x)/C(N,n) X is a random variable and I will write that X ~ HG(N,m,n). It can be shown that mean[X] = nm/N or mean[X] = np where p = m/N and that variance[X] = npq (N-n)/(N-1) where q = 1-p If N >> n then the hypergeometric distribution can be approximated by the binomial distribution. My Statistics 1H� notes, which I dredged out last night, suggest that this approximation is reasonable for N >= 10n. Unfortunately in the context of the base note that corresponds to a bug find rate of not more than 10% which I wouldn't consider either desirable or satisfactory. It is of course quite reasonable in sampling fish populations. In any case we can use the formula for the mean to derive an estimator for N, normally written as N with a ^ above it but I'll write it here as N'. N' = nm/X If we imagine tagging the second capture with a red tag and releasing them again, we now have three colours of tagged fish viz. blue, red and blue-and- red. If we then capture a third time we get three random variables depending on what colour we count. However these random variables are taken from distributions different from X and different from each other, hence I would advise caution in just combining them. Also note that while X ~ HG(...), N' is not. On the other hand 1/N' is at least proportional to a random variable ~ HG(...). This suggests that perhaps the harmonic mean is the appropriate means (n.p.i.) of combining multiple results - or perhaps a weighted harmonic mean - but this is just a guess. This in no way addresses the concerns that bugs are not all equally likely to be found. At least with current technology, bugs, unlike fish, do not become wary and hence less likely to be found the second time. (-: �1H translates as 1=first year university, H=half subject. My other half subject was Computing 1H and the rest, as they say, is history.
1990.7	practicalities...?	MOVIES::ROBLES	Russell Robles, EDO 13	`Thu Sep 14 1995 10:59`	19
	re: .0 > The program is given to 2 independent testing teams, > which find n(1) and n(2) errors respectively, > k(1,2) of which are common. > > Using the hypergeometric distribution, the MLE (= Maximum > Likelihood Estimate) for the total number of errors in > the program, N , can be shown as: > > ^ n(1) * n(2) > N = floor ----------- > k(1,2) This has interesting edge conditions: if team 1 finds a subset of the bugs found by team 2, we would estimate that team 2'd found all the bugs....no matter how few team 1 found. Nothing to do with the problem, of course...
1990.8	fish and bugs not uniformly distributed	RANGER::BRADLEY	Chuck Bradley	`Thu Sep 14 1995 13:05`	28
	re .6, estimating fish population. there are lots of practical difficulties with fish as well as with bugs. it works fine for some species and not so fine for others. basically the assumption is that the fish are uniformly distributed. most fishermen will argue that is not true. try the experiment on a species that swims in schools. if the second dip hits the same school, you get an estimate of the size of the school. if it hits a different school, you get the same estimate of the population no matter how many schools there are. bugs are also not uniformly distributed, hence the common wisdom of rewriting or inspecting the few modules of sw product that have the most reported bugs. even with the limitations, the math is interesting. it would be nice to compare the estimates to the lower limits provided by subsequent experience. for example, continue to sample, fix, and compute until the estimate is less than a suitable limit. then release. count subsequent problems. if a lot of projects did it, and shared results, we might get better at software development. of course some projects of companies will pick the suitable limit to be the number of bits in the product, or the estimate produced on the day the product is scheduled to be released.
1990.9	more along the lines of .5	DECADA::YODER	MFY	`Mon Sep 18 1995 12:31`	33
	It might be worthwhile to give a proof of the formula for the mean of X, since I suspect the technique may be applicable to the harder problem. Let C(a,b)=0 if a<0 or a>b, where C(a,b)=a!/(b!(a-b)!) otherwise; and let sum(x) abbreviate sum(x in Z). We have P(X=x) = C(m,x)C(N-m,n-x)/C(N,n) The expected value for x is sum(x) xC(m,x)C(N-m,n-x)/C(N,n) To evaluate this, note x m sum(x) C(m,x)z = (1+z) x m-1 sum(x) xC(m,x)z = mz(1+z) by differentiating & multiplying by z, x N-m sum(x) C(N-m,x)z = (1+z) N-1 x The product of these two series is mz(1+z) = sum(x) mC(N-1,x-1)z Equating terms, sum(x) xC(m,x)C(N-m,n-x) = mC(N-1,n-1) So the expected value is mC(N-1,n-1)/C(N,n) = mn/N. My first try for the k-team situation would be to note that x is related to the quantity t, total number of bugs found by some team, by the formula x = m+n-t in the case of 2 teams; I'd try, in the k-team case, to get an expected value for t and then use that to get an estimator for N.
1990.10	conjectures	DECADA::YODER	MFY	`Fri Sep 22 1995 12:36`	34
	I've been very busy, and will remain so for the foreseeable future. Maybe somebody can prove or disprove these? Conjecture 1: if t is the total number of bugs found by some team, then the expected value for t/N (given the n(i) and N) is __ 1 - \|\| (1 - n(i)/N) 1<=i<=k where k is the number of teams, n(i) the number of bugs found by team i, and N the total number of bugs. This conjecture is true for k=0, k=1, and k=2. So, by induction, it could hardly fail to be true for larger k. :-) To get an estimate for N given t, try letting __ t/N = 1 - \|\| (1 - n(i)/N). k-1 k __ tN = N - \|\| (N - n(i)) which leads to N being the root of a polynomial of degree k-1 (if t is not the sum of the n(i)) and of degree n-2 otherwise. Conjecture: if t is the sum of the n(i), and k>1, the MLE for N is infinity (speaking loosely). Conjecture: if t isn't the sum of the n(i), there's only one root r of this polynomial which is >= the max of the n(i). The MLE is r if r is an integer, otherwise it's one of the two integers closest to r. For k=2 this is true according to .0 and moreover the MLE is the smaller integer; I'd be surprised if this was true generally, but perhaps there is a simple way of expressing which of the two is the MLE.