T.R | Title | User | Personal Name | Date | Lines |
---|
831.1 | shake out those cobwebs | ZFC::DERAMO | Your wish is my .com | Thu Feb 25 1988 18:56 | 17 |
| This has a history. I believe a mathematician was asked by a
butterfly collecting friend to figure out how many species there
probably were based on the number of different species found among
the known number of collected insects. He solved the problem.
Recently, some literary works were tested [either Shakespeare or
the Federalist papers] to see if some [newly discovered? disputed
author?] text had the proper number of "new" words in it. I.e.,
given x distinct words out of the author's total output of N, if
he also wrote this little m word text how many of those words should
be "new" and not among the previously used x words? They used the
same method as in the butterfly problem.
Sorry for how vague this is, but that's all I really remember of
it. The story may have been from the Boston Globe's "Sci-Tech"
section of their Monday newspaper, sometime in the last few years.
Dan
|
831.2 | You've probably got them all... | CADM::ROTH | If you plant ice you'll harvest wind | Fri Feb 26 1988 08:53 | 21 |
| If there are (hypothetically) N sayings, and you have seen, M < N
of them in T trials you can easily evaluate the probabliltiy that
in all of those T trials you've never drawn one of the N-M
unseen sayings; it's just (M/N)^T.
Thus in 1000 trials we get some sample probabilities (M = 157):
N-M M/N (M/N)**1000
--- -------- ------------
1 0.993671 0.174810E-02
2 0.987421 0.318074E-05
3 0.981250 0.602102E-08
4 0.975155 0.118516E-10
5 0.969136 0.242459E-13
6 0.963190 0.515286E-16
7 0.957317 0.113711E-18
8 0.951515 0.260440E-21
9 0.945783 0.618820E-24
10 0.940120 0.152469E-26
- Jim
|
831.3 | Not quite that easy. | FGVAXZ::SPELLMAN | | Fri Feb 26 1988 09:51 | 19 |
|
re .2:
> If there are (hypothetically) N sayings, and you have seen, M < N
> of them in T trials you can easily evaluate the probabliltiy that
> in all of those T trials you've never drawn one of the N-M
> unseen sayings; it's just (M/N)^T.
If we have 3 sayings and 4 trials, listing the possibilities produces:
found combinations probability
1 3 3/81
2 42 42/81
3 36 36/81
Your formula produces (2/3)^4 or 16/81. So I think there is something
missing in your logic, although I don't know what.
Chris
|
831.4 | | CADM::ROTH | If you plant ice you'll harvest wind | Fri Feb 26 1988 10:05 | 15 |
| After entering my note I realized that it was a total babble...
since for one thing it assumes that you know the set of sayings
that have been omitted in the hypothesis.
Rather than thinking further I tried a monte carlo test on this.
What is the likelyhood that in 1000 trials at least one of 158 histogram
bins has not been selected?
In 774 out of 1000 trials (each making 1000 random selections)
all bins were hit at least once.
Best not to even attempt to do probablility before having coffee in the
morning!
- Jim
|
831.5 | Word Frequency Analysis | NAC::PICKETT | David - Dept. of Redundancy Dept. | Thu Mar 03 1988 16:47 | 10 |
| The statistical analysis of word counts was also applied to the
letters of Abbelard and Elouise (sp?) This was detailed in the Globe's
Sci-Tech section. A loal authority on medevil litterature has proposed
that this classic romantic dialog was written by one person, namely
Abbelard. He used word frequency counts to form a 'signature' of
the author's style. His research led him to believe that the author
of all the letters was the same person.
More Trivia,
dp
|
831.6 | preaching the Bayesian gospel | PULSAR::WALLY | Wally Neilsen-Steinhardt | Fri Apr 08 1988 15:10 | 80 |
| Bayesian analysis is the tool for this kind of problem.
But first, two related problems:
Given the set of serial numbers in captured German tanks, what
is the probablility that the Germans currently have N tanks
deployed? This problem was actually stated and solved during
WW2. It was made easier by the fact that the Germans actually
used consecutive serial numbers on tanks.
Given that testing has found n bugs in a program, what is the
probability that there are m bugs remaining? This problem is
often stated, but I have never seen it solved, because the
probability of finding different bugs is not equal or constant.
The problem states that we have drawn n samples from a space of
objects, assumed equally likely to be drawn, and have obtained m
unique values. The question is, how do we interpret this observation
as evidence for or against the proposition that there are exactly
m unique values in the sample space?
First, introduce k such that k+m is the true number of unique values
in the sample space. Then the proposition that we have seen all
unique values is equivalent to saying k = 0.
The most convenient form of Bayes' law for this problem is the ratio
form:
P( A | B ) P( B | A ) P( A )
---------- = ---------- * ------
P( A'| B ) P( B | A') P( A')
Where P( A | B ) is read as the probability of A given B. The values
we will give to these propositions are
A k=0
A' k=1
B m unique values are seen in n samples
The equation above can be read as: the ratio of the probability
that k=0 to the probability that k=1 given the evidence is equal
to the ratio of the probability that k=0 would give that evidence
to the probability that k=1 would give that evidence, times the
ratio of the probability that k=0 before the evidence to the
probability that k=1 before the evidence.
The last two probabilities, often called the prior probabilities,
can be taken to be equal in this case, as usual. This does assume
that k is bounded, but after all, we would not expect Salada to
put an infinite amount of work into thinking up sayings.
So our problem is really deciding the probabilities of the evidence
given values for k. Consider the sample space of m+k values divided
into a subset of m values and a subset of k values. Then on any
single draw, the probability of drawing from the m subset is given
by
P(1,k,m) = m / ( m + k )
and the probability drawing only from the m subset in n draws is
P(n,k,m) = ( m / ( m + k ) )**n
When k=0, this probability is 1 (reasonably enough). For k = 1
we can substitute into Bayes' equation above to get
P( A | B )
---------- = ( ( m + 1 ) / m )**n
P( A'| B )
For the values in this problem, this evaluates to about 572. So
we can be fairly confident that there is not one more undiscovered
saying out there. We should really compute probabilities for all
k, but since this is a polynomial of degree n in k, we are not
surprised to see that higher values of k are very improbable. For
k=2, the ratio is about 300,000.
All this assumes that all sayings are equally probable. If sayings
are introduced in batches, that will not be true.
|
831.7 | back to the scratch paper | PULSAR::WALLY | Wally Neilsen-Steinhardt | Fri Apr 08 1988 15:55 | 14 |
| Having inserted .6, I looked back at earlier notes (understanding
the problem may inhibit creativity ;-).
Yes, .3 has a point. We cannot assume that a *given* one of the
158 is excluded. If in a test we see 157 out of 158, then any one
of the 158 could have been excluded.
So .6 does not really solve the problem. What I need is a better
calculation of P(A|B) or
Given m + k unique values in a sample space, what is the
probability of seeing exactly m in n samples, for given k?
I'll have to think about that a while.
|
831.8 | | CADM::ROTH | If you plant ice you'll harvest wind | Fri Apr 08 1988 18:23 | 8 |
| I made a calculation using the inclusion-exclusion principle and
came up with a value quite close to the montecarlo value mentioned
earlier. The analysis neglected the case that not just one but
two or more sayings were omitted, and I didn't post it because the
inclusion-exclusion principle didn't correctly apply to those
remaining cases... but it was on the right track.
- Jim
|
831.9 | combining .0 and .6 gives answer | PULSAR::WALLY | Wally Neilsen-Steinhardt | Mon Apr 11 1988 13:12 | 20 |
| And it turns out that the formula I wanted was in .0 all the time.
Using the notation of .0, Bayes' Theorem can be applied to give:
p( n=f ) p( f, f, t )
-------- = ------------
p( n=f+1 ) p( f+1, f, t )
Notice that the leading term in the sum contains (f/n)^t, which
is just what keeps popping up as a bad guess in .2 and .6. And
with t >> f, this should give a similar result.
Bayes' theorem can also be written to give a probability, as opposed
to a ratio, and then it looks something like
P( A | B ) = P( B | A ) / ( sum of P( B | A ) for all As )
and the denominator is really just a normalizing factor which ensures
that all the probabilities are less than or equal to 1.
|