T.R | Title | User | Personal Name | Date | Lines |
---|
1291.1 | Mass testing. | CADSYS::COOPER | Topher Cooper | Tue Sep 04 1990 18:17 | 26 |
| Here is another, rather topical, "paradox" of conditional probability
(paradox in the sense that its results are contrary to "common sense"),
that's been making the rounds in various forms. The numbers used here
are my own -- they produce rather "tidy" calculations. I'm casting
this in the form of employee drug testing, but it is applicable to
many forms of "group testing". An obvious example is mass testing for
HIV virus antibodies, but there are other examples, not all of them
medical.
--------------------------------------------------------------------------
A particular company has decided to institute a program of testing for
drug use among its employees. In the general population from which the
employees are randomly selected 1% are drug users. The test has a 1%
failure rate both in the sense that out of all non-drug users tested 1%
will get (false) positives, and in the sense that out of all drug users
tested 1% will get (false) negatives. A randomly chosen employee is
tested and is found to test positive, with unpleasent consequences to
the employee.
The question is: what is the probability that the employee is, in fact,
a drug user? Try first to guess the answer without calculation --
preferably without taking into account my presumed motivations for
posting this. The calculated answer will be in the next reply.
Topher
|
1291.2 | Mass testing answer. | CADSYS::COOPER | Topher Cooper | Tue Sep 04 1990 18:29 | 22 |
| Probability of an incorrect accusation = 1/2.
Imagine that there are 10000 employees tested.
By hypothesis, the expected number of drug users are 100.
Among actual drug we expect 1%, or 1 drug user to get a false
negative.
Therefore we expect 100-1 = 99 drug users to get a (true)
positive.
We would expect that there will be about 10000 - 100 or 9900
non-drug-users.
Among them we expect 1%, or 99 non-drug-users to get a false
positive.
We expect, that there will be the same number (99) of false positives
as true positives overall, and thus that any particular positive is
as likely to be false as true.
Think about it.
Topher
|
1291.3 | But is this the problem? | CIVAGE::LYNN | Lynn Yarbrough @WNP DTN 427-5663 | Wed Sep 05 1990 11:44 | 5 |
| While your point is well taken - Paulos discusses it as well - you have
missed the point of the base note. The author screwed up, either
intentionally or not (in the following section of the chapter he discusses
the very trap that he fell into). It's worth thinking through exactly where
he went off the rails, so I will try to refrain from saying more....
|
1291.4 | | GUESS::DERAMO | Dan D'Eramo | Wed Sep 05 1990 12:52 | 14 |
| >> .0 "Consider now some randomly selected family of four. Given that Myrtle has
>> a sibling, what is the conditional probability that her sibling is a
>> brother? Given that Myrtle has a *younger* sibling, what is the conditional
>> probability that her sibling is a brother?"
>> .3 The author screwed up
Aha! The author wrongly assumes that the randomly selected
family of four is made up of a father, a mother, and two
children, and that Myrtle is one of the two children and is
a girl. It could be that Myrtle is the mom, and she happens
to have a brother, who is not a member of the family of four.
Dan
|
1291.5 | ...but did I? | CADSYS::COOPER | Topher Cooper | Wed Sep 05 1990 14:31 | 41 |
| RE: .3 (Dan)
Whether this could be considered an error or not is a matter of
philosophy. Given any finite description of a problem we must make
certain assumptions in order to solve that problem. If we don't
make assumptions then: why assume that Myrtle is female rather than
a male with a weird name? Why assume that Myrtle is a member of the
family at all? Why assume that the probability of a girl vs a boy
is reasonably approximated by � -- maybe this takes place in a small
rural Chinese village where girl-babies are frequently killed at birth?
Why assume that the informant is telling the truth -- maybe its really
a family of 12? Maybe the informant is speaking in code?
Word problems differ from first hand real world problems in one
important way. In a first hand real-world problem we would deal with
these issues probabilistically, and, if we feel that alternatives have
slightly too high a probability than we could conditionalize our answer
by our assumptions. In a word problem there is a convention that
routine assumptions which are required for the solution of the problem
are correct unless contradicted -- directly or subtly (in which case it
is a "trick" question) -- by information provided. This is doubly true
in problems provided for expository purpose rather than for puzzles.
It would be legitimate for an author to announce for later expository
purposes that they had "screwed up" in a problem -- that they had
through error or connivance not followed the expository rules and
stated all necessary information. It is legitmate to do this to make
the point that in the "real world" reasonable assumptions necessary for
solving the problem are not always correct. It is then an amusing
exercise to see what assumptions may most plausibly be modified. But
it is the author who "screwed up" (or pretends to have), not the reader
who accepted a chain of reasoning and assumed that the author had
"played fair" in presenting the problem.
If this was the "point" of .0 (rather than some other error which
continues to elude me) then I am quite content to have missed it. I
would be embarrased to be such a nit-picker as to have caught it
(and I am, as those who know me will acknowledge, a world-class
nit-picker :-)).
Topher
|
1291.6 | As I see it... | CIVAGE::LYNN | Lynn Yarbrough @WNP DTN 427-5663 | Wed Sep 05 1990 16:14 | 38 |
| My point follows, after a <ff> and suitable spaces to push a DECWindows
image off the screen:
The author's conclusion is nonsense: the birth order of children is
irrelevant to their sex, and the 2/3 probability he obtains is completely
off-the-wall. The author's analysis of cases is wrong. In general, when
the sexes of both siblings are unknown, the four relevant cases are BB, BG,
GB, GG. But when the sex of one, Myrtle, is known - and it doesn't matter
what sex Myrtle is! - the four relevant cases are BM, MB, GM, and MG, where
M is Myrtle; as reason and instinct would indicate, the odds of B vs G for
the sibling are (as roughly as you like) 50-50.
While knowing that the sibling is younger does add information, it is
irrelevant information - like knowing that it snowed the day Myrtle was
born - and has no effect on the conditional probabilites.
|
1291.7 | ...and more detailed | CIVAGE::LYNN | Lynn Yarbrough @WNP DTN 427-5663 | Wed Sep 05 1990 16:37 | 32 |
| Further explanation follows, after a <ff> and suitable spaces to push a
DECWindows image off the screen:
According to Paulos, the four relevant cases are BB, BG, GB, GG. He
correctly dismisses the BB case, but then states that the remaining cases -
BG, GB, and GG - are equally likely. That's false - it fails to take into
account that Myrtle may be either older or younger, so the GG case is
actually Gg plus gG, where g is the sister, thus twice as likely as either
of the other two, and as likely as both.
|
1291.8 | Beg to disagree... | CADSYS::COOPER | Topher Cooper | Wed Sep 05 1990 18:51 | 170 |
| RE: .6, .7
Further discussion follows, after a <ff> and suitable spaces to push a
DECWindows image off the screen:
I think that this is exactly the type of thing I was talking about in my
previous note, albeit a more subtle case of it. The "answer" we get
depends on the assumptions made about the procedure by which the expositor
came to make the statement which (s)he did. It is my opinion that the
most direct assumption for purposes of a word problem lead to the reasoning
in .0, while the reasoning in .7 would have to be directly stated to be
accepted as a "fair" problem statement.
The assumption in .0 comes from the following presumed selection procedure.
Given a population of families of four (two children), for the first
subproblem:
1) A random family is sampled.
2) If both children are boys repeat 1.
3) If only one of the children is a girl, ask her name and announce
it. If both children are girls, choose one of them at random
and announce her name.
That is, we select a random family in which at least one of the
children are a girl, and we provide that (or one of the) girls names.
The process for the second subproblem is the same except that steps 2
and 3 become:
2') If the older child is a boy repeat 1.
3') Announce the name of the older child (a girl).
In this case we select a family at random in which the older child is
a girl and we provide *her* name.
Under these assumptions, the analysis in .0 is correct. If anyone is
in doubt, here is the main routine of a simulation I wrote (appologies
to BLISSaphobes):
ROUTINE mainroutine =
BEGIN
LOCAL
oldercount,
girlcount,
yngrbrocount,
brocount;
oldercount = girlcount = yngrbrocount = brocount = 0;
randinit();
INCR simulations FROM 1 TO number_simulations DO
BEGIN
LOCAL
family; ! Encoded as two bits, high bit older sibling, low bit
! the younger sibling. Bit=1 a boy, Bit=0 a girl.
family = randfamily();
SELECT .family OF
SET
[0, 1]:
! An older girl
oldercount = .oldercount + 1;
[0, 1, 2]:
! A girl.
girlcount = .girlcount + 1;
[1]:
! A younger brother to a girl.
yngrbrocount = .yngrbrocount + 1;
[1,2]:
! A brother to a girl.
brocount = .brocount + 1;
TES;
END;
display(.yngrbrocount, .oldercount, .brocount, .girlcount);
SS$_NORMAL
END;
I ran the simulation 10000 times. The result was that:
proportion of times that an older sister had a younger brother:
2494/4973 = .5015081 ~= 1/2
proportion of times that there was at least one sister that she
had a brother:
5042/7521 = .6703895 ~= 2/3
I believe that this is the most natural interpretation of the problem,
and the one which is most reasonable for a reader to assume was
intended.
Other interpretations *are* possible, however, and these lead to other
results. We can for example, assume that the following process was
used for the first subproblem:
1) A random family is sampled.
2) If both of the children are named Myrtle then repeat from 1
(an unnatural naming situation).
3) If neither of the children are named Myrtle then repeat from 1.
We assume here that a child named Myrtle will be a girl.
4) Announce that one of the children is named Myrtle.
The second subproblem is similar except 3 and 4 become:
3') If the older child is not named Myrtle then repeat from 1.
4') Announce that the older child is named Myrtle.
Here the probabilities are always equal, but dependent on the
probability that a randomly selected girl will be named Myrtle. The
probability in either case of Myrtle having a brother is:
1
-------
2 - p
Where p is the probability of a girl being named Myrtle.
When I simulated this situation (still running 10000 simulations) with
a probability of 1/2 that any particular girl would be named Myrtle
(program available upon request):
proportion of times that an older Myrtle had a younger brother:
2502/3743 = .6684477 ~= 2/3
proportion of times that a Myrtle had a brother:
1275/1923 = .6630265 ~= 2/3
Here knowing that Myrtle was the older sibling added no additional
information, but unlike the analysis in .7 the probability of a (a
younger) brother is not 1/2, and cannot be 1/2.
The analysis in .7 seems to be based on the rather poorly motivated
selection procedure:
1) Select a (girl) child named Myrtle, and examine her family.
2) Announce that the family has a child named Myrtle.
The second subproblem uses:
1') Select an eldest (girl) child named Myrtle, and examine her
family.
2') Announce that the family has an eldest child named Myrtle.
I would say that this is a rather unnatural interpretation for the
problem stated in .0, although possible.
Topher
|
1291.9 | | GUESS::DERAMO | Dan D'Eramo | Wed Sep 05 1990 19:26 | 34 |
| And here is my simulation:
Lisp> (pprint-definition 'trial)
;;; The function TRIAL has a compiled definition.
(DEFUN TRIAL ()
(LET ((K1 (SVREF '#(JOE FRED MYRTLE SUSAN) (RANDOM 4)))
(K2 (SVREF '#(JOE FRED MYRTLE SUSAN) (RANDOM 4))))
(IF (EQ K1 K2)
(TRIAL)
(IF (OR (EQ K1 'MYRTLE) (EQ K2 'MYRTLE)) (LIST K1 K2) (TRIAL)))))
Two kids are generated at random. Cases where the two
have the same name, or where neither is named Myrtle,
are rejected. Of the remaining cases, in 10,000 trials
I ended up with
(JOE MYRTLE) 1698 times
(MYRTLE FRED) 1651 times
(FRED MYRTLE) 1655 times
(MYRTLE SUSAN) 1621 times
(SUSAN MYRTLE) 1748 times
(MYRTLE JOE) 1627 times
--------------------------
10000 times
(The TRIAL function implements that each family is one
in which there is one Myrtle and that she has a sibling.)
Of the 10,000 cases, in 6,631 of them Myrtle has a
brother (66%). Of the 4899 cases in which Myrtle has
a younger sibling (i.e., in which Myrtle is older), in
3278 of them (67%) Myrtle has a brother.
Dan
|
1291.10 | | CADSYS::COOPER | Topher Cooper | Thu Sep 06 1990 13:38 | 18 |
| RE: .9 (Dan)
For those not following real close, Dan's LISP simulation is a
simulation of a situation analytically equivalent to my second pair
of assumptions about how sampling is done. If he expanded the number
of names to six (three boys and three girls names) he would get
results of around 60% for both cases.
Dan's outer IF expression could be rewritten as:
(IF (XOR (EQ K1 'MYRTLE) (EQ K2 'MYRTLE))
(LIST K1 K2)
(TRIAL))
... if Common LISP included an XOR predicate, which it does not -- an
odd lack in "The language for those who want everything.".
Topher
|
1291.11 | determining what's routine is not routine :-) | EAGLE1::BEST | R D Best, sys arch, I/O | Thu Sep 06 1990 16:46 | 50 |
| re .5
>RE: .3 (Dan)
>
> Whether this could be considered an error or not is a matter of
> philosophy. Given any finite description of a problem we must make
> certain assumptions in order to solve that problem. If we don't
> make assumptions then: why assume that Myrtle is female rather than
> a male with a weird name? Why assume that Myrtle is a member of the
> family at all? Why assume that the probability of a girl vs a boy
> is reasonably approximated by � -- maybe this takes place in a small
> rural Chinese village where girl-babies are frequently killed at birth?
> Why assume that the informant is telling the truth -- maybe its really
> a family of 12? Maybe the informant is speaking in code?
>
> Word problems differ from first hand real world problems in one
> important way. In a first hand real-world problem we would deal with
> these issues probabilistically, and, if we feel that alternatives have
> slightly too high a probability than we could conditionalize our answer
> by our assumptions. In a word problem there is a convention that
> routine assumptions which are required for the solution of the problem
> are correct unless contradicted -- directly or subtly (in which case it
> is a "trick" question) -- by information provided. This is doubly true
> in problems provided for expository purpose rather than for puzzles.
I agree with your statements about having to make some assumptions.
However, I often have trouble determining what the 'routine assumptions' are.
I have noticed an interesting (and annoying) psychological phenomenon.
As I've grown older and more experienced(?), I find I have to take more
time solving word problems. At first, I wondered whether I might be
dimming intellectually (still a possibility :-), but have since come to the
conclusion that the difficulty principally arises from being able to produce
more problem interpretations and self-consistent assumption sets than I
would have been able to in the past.
I now more frequently conclude that problem statements are 'underspecified'.
Some related comments on proofs:
When reviewing proofs, I find that I have become highly
sensitive to unstated assumptions, and frequently find that the proofs are
not as convincing as they used to be. When I try proving things myself,
I frequently wind up with a set of relatively low level subproofs that
can prove maddeningly difficult to crack, because they consist of
things I previously thought of as 'immediately obvious to the most casual
observer', but now feel the need to justify. I got a good dose of this
in independently working on MATH 313 (the integer rectangle problem).
(Does anyone else experience this ??)
|
1291.12 | You look good in a uniform - distribution | CIVAGE::LYNN | Lynn Yarbrough @WNP DTN 427-5663 | Thu Sep 06 1990 17:26 | 52 |
| > The assumption in .0 comes from the following presumed selection procedure.
> Given a population of families of four (two children), for the first
> subproblem:
> 1) A random family is sampled.
> 2) If both children are boys repeat 1.
> 3) If only one of the children is a girl, ask her name and announce
> it. If both children are girls, choose one of them at random
> and announce her name.
> That is, we select a random family in which at least one of the
> children are a girl, and we provide that (or one of the) girls names.
While that is a random way of choosing a girl for the 'experiment', it does
not appear to me to be based on a uniform distribution. Assuming 'Myrtle'
is available, in the cases BG and GB she is 100% likely to be the girl of
interest, but in the GG case only 50%; the process is biased toward
choosing BG siblings.
If we select one child from a pair from a uniform distribution and find
that it is a girl, the odds are 2-1 that the choice was from GG as opposed
to either BG combination; given that a girl was selected the complete set
of probabilities is:
BB 0
BG .25
GB .25
GG .50
and again in half the cases the sibling is a boy, in half a girl.
To my mind a better algorithm to model the situation is:
Choose a pair p randomly from <00,01,10,11>. Choose a member m from <0,1>.
Examine the mth bit of p. If it's a 1(boy), reject the case; that can't be
Myrtle. If it's a 0, it's Myrtle, so increment the count of girls if the
other bit is 0, else increment the count of boys.
If we restate the problem in purely non-sexist terms it becomes a bit
clearer, I think. (My name is unisex, so I will use it in what follows.)
Consider a random pair of siblings. One of them happens to be named Lynn.
What are the chances that Lynn's sibling is of the opposite sex? Under the
assumption that the sexes are equally common and uniformly distributed, it
is very hard to come up with any answer other than 50-50.
To model this case, the algorithm would be:
Choose a pair p randomly from <00,01,10,11>. If the bits are the same,
increment the count of like siblings, else increment the count of different
siblings.
Even if males and females are not equally common, the answer is very close
to 50-50. Assume that boys occur 40% of the time, girls 60%. Then BB+GG
will occur (.4*.4+.6*.6)=.52 of the time, and (BG+GB) will occur
(.4*.6+.6*.4)=.48. The sum of the (BB+GG) frequencies tends to be the same
as the sum of the (BG+GB) frequencies.
|
1291.13 | Uniform distributions -- multiform problems. | CADSYS::COOPER | Topher Cooper | Thu Sep 06 1990 18:49 | 38 |
| RE: .12 (Lynn)
Lynn, you are absolutely correct, but you are solving a different
problem than the one stated -- unless you stretch the assumptions about
interpretation quite far. If you uniformly select a girl (or a girl
named Myrtle) and ask what the a postori probability that (assuming
the girl has a single sibling) Myrtle's family is a gg family, then
the answer is, without question, 1/2. Why? Because gg families are
twice as likely to be sampled as bg and gb families -- the *families*
are not being sampled uniformly.
But the problem statement is:
Consider now some randomly selected family of four.
Using the "default" definition of "randomly selected" (i.e., uniformly
sampled) this implies strongly that the family *is* selected uniformly
rather than the girl. I do not consider this a legitimate
interpretation of the problem: the propounder was not playing fair if
this was the intention. I would accept as perhaps a legitimate
difference in "default reasoning" a different interpretation of how
the actor came to announce that the family has a child named Myrtle,
but *not* shifting the "random selection" to uniform selection of a
child.
For this to be a reasonable statement, we would have had to have a
statement to the effect:
Consider now a randomly selected child with a single sibling.
The analysis in .0 is an accurate analysis of a reasonable
interpretation of the problem statement (and I think the most reasonable
interpretation). The analyis in .12 is an accurat analysis of another
problem or of a rather unreasonable interpretation of the same problem
-- by no stretch of the imagination can it be taken as THE uniquely
correct interpretation/analysis of the problem stated in .0
Topher
|
1291.14 | What? There is still something left to say? | CSSE::NEILSEN | I used to be PULSAR::WALLY | Fri Sep 28 1990 13:48 | 58 |
| I've been puzzling over the example in .0 for six months, off and on. The one
thing I am sure of is that it is not a good example of mathematical illiteracy,
since literate people can give different answers.
My thanks to Lynn and Topher, because their comments have helped me to clarify
my thinking greatly. What follows is to some extent just a clarification of
their ideas, although I reach a different conclusion.
Look carefully at the first two sentences in Paulos' statement of the problem.
>"Consider now some randomly selected family of four. Given that Myrtle has
>a sibling, what is the conditional probability that her sibling is a
>brother?
Ignoring the many enticing bypaths suggested by Dan in .4, it is clear that
there is something missing between these two sentences. In a randomly selected
family of four, the probability is very small that there will be a child named
Myrtle. There is only a 75% probability that there will be a girl child at
all. The question is what is missing.
Topher suggests that first sentence should be "Consider a randomly selected
family of four with at least one girl child", and Paulos seems to have had
something like this in mind, judging by his analysis. A sentence must also be
inserted, something like "Announce the name of a girl child, chosen randomly
from the set of girl children in the family."
There is another quite reasonable insertion. Add the second sentence: "Select
a child at random and announce its name." This is close to, but I think not
identical to, Lynn's suggestion in .12. This gives 50% as the answer to both
questions, as the following table shows:
Case Pair of Child Sex of Sex of
# Children Selected Selected Other
Child Child
1 BB 1 B B
2 BB 2 B B
3 BG 1 B G
4 BG 2 G B
5 GB 1 G B
6 GB 2 B G
7 GG 1 G G
8 GG 2 G G
All cases are equally probable, assuming that the selection is random and
independent. Cases 4,5,7 & 8 apply to Paulos' first question; there are two B
and two G cases for the other child. Cases 4 & 8 apply to Paulos' second
question; there is a B case and a G case. So the answer is 50% in both cases.
Both ways of interpreting the text seem to me acceptable, although there are a
few reasons for preferring the second. The first interpretation requires
changing the meaning of the first sentence. Under it the family of four is
not randomly selected; it is selected according to an implicit condition. The
second interpretation seems to make less of a change to the actual words of the
problem.
I guess this is another illustration of how hard it is to write unambiguously.
|