[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

831.0. "That's a lotta tea bags." by FGVAXZ::SPELLMAN () Thu Feb 25 1988 15:12

    I drink a lot of tea and a while ago started accumulating Salada tea
    bag sayings, one from each bag.  I have 157 so far (with lots of
    duplicates) and have probably drunk over 1000 cups of tea (with the
    help of friends).  I began to wonder if I had them all, after it had
    been a few months since I gotten a new one.  Enter ... (drum roll,
    please) the remnants of College Math. 

    I'm assuming that the sayings are uniformly distributed.  I use the
    notation: 

	p(n,f,t)

    for the probability of finding f different sayings out of a population
    of n, having looked at t tea bags.  I came up with the following
    recursive formula: 

	p(n,f,t) = f/n * p(n,f,t-1) + (n-f+1)/n * p(n,f-1,t-1)

    if 1 < f < t.  For boundary conditions (is that the right term?), I
    got: 

	p(n,1,t) = n / (n^n)   and
	p(n,t,t) = n! / ((n-t)! * n^t)
		 = ch(n,t) * t! / (n^t)   where ch(n,t) is the bin. coeff.

    After banging on my VAX, I came up with the non-recursive formula: 

                              f
	p(n,f,t) = ch(n,f) * sum((i/n)^t * ch(f,i) * (-1)^(f-i))
                             i=1
    
    Now, my question is do I have all of the sayings?  I don't know how to
    ask this mathematically, but I think it has to do with confidence
    intervals and likelihoods.  Can anyone out there shed some light? 

    Thanks
    Chris

    p.s.  I have been through only a piece of the notes file; I'm up to
    about Sept. 1986.  I couldn't find anything like this with the title
    search.  If there is a good pointer, please let me know. 
T.RTitleUserPersonal
Name
DateLines
831.1shake out those cobwebsZFC::DERAMOYour wish is my .comThu Feb 25 1988 18:5617
    This has a history.  I believe a mathematician was asked by a
    butterfly collecting friend to figure out how many species there
    probably were based on the number of different species found among
    the known number of collected insects.  He solved the problem.
    Recently, some literary works were tested [either Shakespeare or
    the Federalist papers] to see if some [newly discovered? disputed
    author?] text had the proper number of "new" words in it.  I.e.,
    given x distinct words out of the author's total output of N, if
    he also wrote this little m word text how many of those words should
    be "new" and not among the previously used x words?  They used the
    same method as in the butterfly problem.
    
    Sorry for how vague this is, but that's all I really remember of
    it.  The story may have been from the Boston Globe's "Sci-Tech"
    section of their Monday newspaper, sometime in the last few years.
    
    Dan
831.2You've probably got them all...CADM::ROTHIf you plant ice you&#039;ll harvest windFri Feb 26 1988 08:5321
  If there are (hypothetically) N sayings, and you have seen, M < N
  of them in T trials you can easily evaluate the probabliltiy that
  in all of those T trials you've never drawn one of the N-M
  unseen sayings; it's just (M/N)^T.

  Thus in 1000 trials we get some sample probabilities (M = 157):

           N-M     M/N         (M/N)**1000
           ---   --------      ------------
            1    0.993671      0.174810E-02
            2    0.987421      0.318074E-05
            3    0.981250      0.602102E-08
            4    0.975155      0.118516E-10
            5    0.969136      0.242459E-13
            6    0.963190      0.515286E-16
            7    0.957317      0.113711E-18
            8    0.951515      0.260440E-21
            9    0.945783      0.618820E-24
           10    0.940120      0.152469E-26

  - Jim
831.3Not quite that easy.FGVAXZ::SPELLMANFri Feb 26 1988 09:5119
    re .2:
    
    > If there are (hypothetically) N sayings, and you have seen, M < N
    > of them in T trials you can easily evaluate the probabliltiy that
    > in all of those T trials you've never drawn one of the N-M
    > unseen sayings; it's just (M/N)^T.

    If we have 3 sayings and 4 trials, listing the possibilities produces:
    
    found	combinations	probability
      1		     3		    3/81
      2		    42		   42/81
      3		    36		   36/81
    
    Your formula produces (2/3)^4 or 16/81.  So I think there is something
    missing in your logic, although I don't know what.
    
    Chris
831.4CADM::ROTHIf you plant ice you&#039;ll harvest windFri Feb 26 1988 10:0515
    After entering my note I realized that it was a total babble...
    since for one thing it assumes that you know the set of sayings
    that have been omitted in the hypothesis.

    Rather than thinking further I tried a monte carlo test on this.
    What is the likelyhood that in 1000 trials at least one of 158 histogram
    bins has not been selected?

    In 774 out of 1000 trials (each making 1000 random selections)
    all bins were hit at least once.

    Best not to even attempt to do probablility before having coffee in the
    morning!

    - Jim
831.5Word Frequency AnalysisNAC::PICKETTDavid - Dept. of Redundancy Dept.Thu Mar 03 1988 16:4710
    The statistical analysis of word counts was also applied to the
    letters of Abbelard and Elouise (sp?) This was detailed in the Globe's
    Sci-Tech section. A loal authority on medevil litterature has proposed
    that this classic romantic dialog was written by one person, namely
    Abbelard. He used word frequency counts to form a 'signature' of
    the author's style. His research led him to believe that the author
    of all the letters was the same person.
    
    More Trivia,
    dp
831.6preaching the Bayesian gospelPULSAR::WALLYWally Neilsen-SteinhardtFri Apr 08 1988 15:1080
    Bayesian analysis is the tool for this kind of problem.
    
    But first, two related problems:
    
    	Given the set of serial numbers in captured German tanks, what
	is the probablility that the Germans currently have N tanks
	deployed?  This problem was actually stated and solved during
	WW2.  It was made easier by the fact that the Germans actually
	used consecutive serial numbers on tanks.
    
    	Given that testing has found n bugs in a program, what is the
	probability that there are m bugs remaining?  This problem is
	often stated, but I have never seen it solved, because the
	probability of finding different bugs is not equal or constant.
    
    
    The problem states that we have drawn n samples from a space of
    objects, assumed equally likely to be drawn, and have obtained m
    unique values.  The question is, how do we interpret this observation
    as evidence for or against the proposition that there are exactly
    m unique values in the sample space?
    
    First, introduce k such that k+m is the true number of unique values 
    in the sample space.  Then the proposition that we have seen all 
    unique values is equivalent to saying k = 0.
    
    The most convenient form of Bayes' law for this problem is the ratio
    form:
    
    	P( A | B ) 	P( B | A )	P( A )
    	----------  =	----------  *	------
    	P( A'| B ) 	P( B | A')	P( A')
    
    Where P( A | B ) is read as the probability of A given B.  The values
    we will give to these propositions are
    
    	A 	k=0
    	A'	k=1
    	B	m unique values are seen in n samples
    
    The equation above can be read as: the ratio of the probability
    that k=0 to the probability that k=1 given the evidence is equal
    to the ratio of the probability that k=0 would give that evidence
    to the probability that k=1 would give that evidence, times the
    ratio of the probability that k=0 before the evidence to the 
    probability that k=1 before the evidence.
    
    The last two probabilities, often called the prior probabilities,
    can be taken to be equal in this case, as usual.  This does assume
    that k is bounded, but after all, we would not expect Salada to
    put an infinite amount of work into thinking up sayings.
    
    So our problem is really deciding the probabilities of the evidence
    given values for k.  Consider the sample space of m+k values divided
    into a subset of m values and a subset of k values.  Then on any
    single draw, the probability of drawing from the m subset is given
    by
    
    	P(1,k,m) = m / ( m + k )
    
    and the probability drawing only from the m subset in n draws is
    
    	P(n,k,m) = ( m / ( m + k ) )**n
    
    When k=0, this probability is 1 (reasonably enough).  For k = 1
    we can substitute into Bayes' equation above to get 

    	P( A | B ) 	
    	----------  =	( ( m + 1 ) / m )**n
       	P( A'| B ) 	
    
    For the values in this problem, this evaluates to about 572.  So
    we can be fairly confident that there is not one more undiscovered
    saying out there.  We should really compute probabilities for all
    k, but since this is a polynomial of degree n in k, we are not
    surprised to see that higher values of k are very improbable.  For
    k=2, the ratio is about 300,000.
    
    All this assumes that all sayings are equally probable.  If sayings
    are introduced in batches, that will not be true.    	
831.7back to the scratch paperPULSAR::WALLYWally Neilsen-SteinhardtFri Apr 08 1988 15:5514
    Having inserted .6, I looked back at earlier notes (understanding
    the problem may inhibit creativity ;-).
    
    Yes, .3 has a point.  We cannot assume that a *given* one of the
    158 is excluded.  If in a test we see 157 out of 158, then any one
    of the 158 could have been excluded.
    
    So .6 does not really solve the problem.  What I need is a better
    calculation of P(A|B) or
    
    	Given m + k unique values in a sample space, what is the
	probability of seeing exactly m in n samples, for given k?
    
    I'll have to think about that a while.
831.8CADM::ROTHIf you plant ice you&#039;ll harvest windFri Apr 08 1988 18:238
    I made a calculation using the inclusion-exclusion principle and
    came up with a value quite close to the montecarlo value mentioned
    earlier.  The analysis neglected the case that not just one but
    two or more sayings were omitted, and I didn't post it because the
    inclusion-exclusion principle didn't correctly apply to those
    remaining cases...  but it was on the right track.

    - Jim
831.9combining .0 and .6 gives answerPULSAR::WALLYWally Neilsen-SteinhardtMon Apr 11 1988 13:1220
    And it turns out that the formula I wanted was in .0 all the time.
    
    Using the notation of .0, Bayes' Theorem can be applied to give:
    
    	p( n=f )		p( f, f, t )
    	--------	  =     ------------
    	p( n=f+1 )		p( f+1, f, t )
    
    Notice that the leading term in the sum contains (f/n)^t, which
    is just what keeps popping up as a bad guess in .2 and .6.  And
    with t >> f, this should give a similar result.
    
    Bayes' theorem can also be written to give a probability, as opposed
    to a ratio, and then it looks something like
    
    	P( A | B ) = P( B | A ) / ( sum of P( B | A ) for all As )
    
    and the denominator is really just a normalizing factor which ensures
    that all the probabilities are less than or equal to 1.