[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

831.0. "That's a lotta tea bags." by FGVAXZ::SPELLMAN () Thu Feb 25 1988 15:12

    I drink a lot of tea and a while ago started accumulating Salada tea
    bag sayings, one from each bag.  I have 157 so far (with lots of
    duplicates) and have probably drunk over 1000 cups of tea (with the
    help of friends).  I began to wonder if I had them all, after it had
    been a few months since I gotten a new one.  Enter ... (drum roll,
    please) the remnants of College Math. 

    I'm assuming that the sayings are uniformly distributed.  I use the
    notation: 

	p(n,f,t)

    for the probability of finding f different sayings out of a population
    of n, having looked at t tea bags.  I came up with the following
    recursive formula: 

	p(n,f,t) = f/n * p(n,f,t-1) + (n-f+1)/n * p(n,f-1,t-1)

    if 1 < f < t.  For boundary conditions (is that the right term?), I
    got: 

	p(n,1,t) = n / (n^n)   and
	p(n,t,t) = n! / ((n-t)! * n^t)
		 = ch(n,t) * t! / (n^t)   where ch(n,t) is the bin. coeff.

    After banging on my VAX, I came up with the non-recursive formula: 

                              f
	p(n,f,t) = ch(n,f) * sum((i/n)^t * ch(f,i) * (-1)^(f-i))
                             i=1
    
    Now, my question is do I have all of the sayings?  I don't know how to
    ask this mathematically, but I think it has to do with confidence
    intervals and likelihoods.  Can anyone out there shed some light? 

    Thanks
    Chris

    p.s.  I have been through only a piece of the notes file; I'm up to
    about Sept. 1986.  I couldn't find anything like this with the title
    search.  If there is a good pointer, please let me know.

T.R	Title	User	Personal Name	Date	Lines
831.1	shake out those cobwebs	ZFC::DERAMO	Your wish is my .com	`Thu Feb 25 1988 18:56`	17
	This has a history. I believe a mathematician was asked by a butterfly collecting friend to figure out how many species there probably were based on the number of different species found among the known number of collected insects. He solved the problem. Recently, some literary works were tested [either Shakespeare or the Federalist papers] to see if some [newly discovered? disputed author?] text had the proper number of "new" words in it. I.e., given x distinct words out of the author's total output of N, if he also wrote this little m word text how many of those words should be "new" and not among the previously used x words? They used the same method as in the butterfly problem. Sorry for how vague this is, but that's all I really remember of it. The story may have been from the Boston Globe's "Sci-Tech" section of their Monday newspaper, sometime in the last few years. Dan
831.2	You've probably got them all...	CADM::ROTH	If you plant ice you'll harvest wind	`Fri Feb 26 1988 08:53`	21
	If there are (hypothetically) N sayings, and you have seen, M < N of them in T trials you can easily evaluate the probabliltiy that in all of those T trials you've never drawn one of the N-M unseen sayings; it's just (M/N)^T. Thus in 1000 trials we get some sample probabilities (M = 157): N-M M/N (M/N)**1000 --- -------- ------------ 1 0.993671 0.174810E-02 2 0.987421 0.318074E-05 3 0.981250 0.602102E-08 4 0.975155 0.118516E-10 5 0.969136 0.242459E-13 6 0.963190 0.515286E-16 7 0.957317 0.113711E-18 8 0.951515 0.260440E-21 9 0.945783 0.618820E-24 10 0.940120 0.152469E-26 - Jim
831.3	Not quite that easy.	FGVAXZ::SPELLMAN		`Fri Feb 26 1988 09:51`	19
	re .2: > If there are (hypothetically) N sayings, and you have seen, M < N > of them in T trials you can easily evaluate the probabliltiy that > in all of those T trials you've never drawn one of the N-M > unseen sayings; it's just (M/N)^T. If we have 3 sayings and 4 trials, listing the possibilities produces: found combinations probability 1 3 3/81 2 42 42/81 3 36 36/81 Your formula produces (2/3)^4 or 16/81. So I think there is something missing in your logic, although I don't know what. Chris
831.4		CADM::ROTH	If you plant ice you'll harvest wind	`Fri Feb 26 1988 10:05`	15
	After entering my note I realized that it was a total babble... since for one thing it assumes that you know the set of sayings that have been omitted in the hypothesis. Rather than thinking further I tried a monte carlo test on this. What is the likelyhood that in 1000 trials at least one of 158 histogram bins has not been selected? In 774 out of 1000 trials (each making 1000 random selections) all bins were hit at least once. Best not to even attempt to do probablility before having coffee in the morning! - Jim
831.5	Word Frequency Analysis	NAC::PICKETT	David - Dept. of Redundancy Dept.	`Thu Mar 03 1988 16:47`	10
	The statistical analysis of word counts was also applied to the letters of Abbelard and Elouise (sp?) This was detailed in the Globe's Sci-Tech section. A loal authority on medevil litterature has proposed that this classic romantic dialog was written by one person, namely Abbelard. He used word frequency counts to form a 'signature' of the author's style. His research led him to believe that the author of all the letters was the same person. More Trivia, dp
831.6	preaching the Bayesian gospel	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Fri Apr 08 1988 14:10`	80
	Bayesian analysis is the tool for this kind of problem. But first, two related problems: Given the set of serial numbers in captured German tanks, what is the probablility that the Germans currently have N tanks deployed? This problem was actually stated and solved during WW2. It was made easier by the fact that the Germans actually used consecutive serial numbers on tanks. Given that testing has found n bugs in a program, what is the probability that there are m bugs remaining? This problem is often stated, but I have never seen it solved, because the probability of finding different bugs is not equal or constant. The problem states that we have drawn n samples from a space of objects, assumed equally likely to be drawn, and have obtained m unique values. The question is, how do we interpret this observation as evidence for or against the proposition that there are exactly m unique values in the sample space? First, introduce k such that k+m is the true number of unique values in the sample space. Then the proposition that we have seen all unique values is equivalent to saying k = 0. The most convenient form of Bayes' law for this problem is the ratio form: P( A \| B ) P( B \| A ) P( A ) ---------- = ---------- * ------ P( A'\| B ) P( B \| A') P( A') Where P( A \| B ) is read as the probability of A given B. The values we will give to these propositions are A k=0 A' k=1 B m unique values are seen in n samples The equation above can be read as: the ratio of the probability that k=0 to the probability that k=1 given the evidence is equal to the ratio of the probability that k=0 would give that evidence to the probability that k=1 would give that evidence, times the ratio of the probability that k=0 before the evidence to the probability that k=1 before the evidence. The last two probabilities, often called the prior probabilities, can be taken to be equal in this case, as usual. This does assume that k is bounded, but after all, we would not expect Salada to put an infinite amount of work into thinking up sayings. So our problem is really deciding the probabilities of the evidence given values for k. Consider the sample space of m+k values divided into a subset of m values and a subset of k values. Then on any single draw, the probability of drawing from the m subset is given by P(1,k,m) = m / ( m + k ) and the probability drawing only from the m subset in n draws is P(n,k,m) = ( m / ( m + k ) )n When k=0, this probability is 1 (reasonably enough). For k = 1 we can substitute into Bayes' equation above to get P( A \| B ) ---------- = ( ( m + 1 ) / m )n P( A'\| B ) For the values in this problem, this evaluates to about 572. So we can be fairly confident that there is not one more undiscovered saying out there. We should really compute probabilities for all k, but since this is a polynomial of degree n in k, we are not surprised to see that higher values of k are very improbable. For k=2, the ratio is about 300,000. All this assumes that all sayings are equally probable. If sayings are introduced in batches, that will not be true.
831.7	back to the scratch paper	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Fri Apr 08 1988 14:55`	14
	Having inserted .6, I looked back at earlier notes (understanding the problem may inhibit creativity ;-). Yes, .3 has a point. We cannot assume that a given one of the 158 is excluded. If in a test we see 157 out of 158, then any one of the 158 could have been excluded. So .6 does not really solve the problem. What I need is a better calculation of P(A\|B) or Given m + k unique values in a sample space, what is the probability of seeing exactly m in n samples, for given k? I'll have to think about that a while.
831.8		CADM::ROTH	If you plant ice you'll harvest wind	`Fri Apr 08 1988 17:23`	8
	I made a calculation using the inclusion-exclusion principle and came up with a value quite close to the montecarlo value mentioned earlier. The analysis neglected the case that not just one but two or more sayings were omitted, and I didn't post it because the inclusion-exclusion principle didn't correctly apply to those remaining cases... but it was on the right track. - Jim
831.9	combining .0 and .6 gives answer	PULSAR::WALLY	Wally Neilsen-Steinhardt	`Mon Apr 11 1988 12:12`	20
	And it turns out that the formula I wanted was in .0 all the time. Using the notation of .0, Bayes' Theorem can be applied to give: p( n=f ) p( f, f, t ) -------- = ------------ p( n=f+1 ) p( f+1, f, t ) Notice that the leading term in the sum contains (f/n)^t, which is just what keeps popping up as a bad guess in .2 and .6. And with t >> f, this should give a similar result. Bayes' theorem can also be written to give a probability, as opposed to a ratio, and then it looks something like P( A \| B ) = P( B \| A ) / ( sum of P( B \| A ) for all As ) and the denominator is really just a normalizing factor which ensures that all the probabilities are less than or equal to 1.