[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1394.0. "Elementary (?) Statistics" by SMEGIT::ARNOLD (Some assembly required) Tue Mar 05 1991 11:00

    I've gone thru this conference & looked at the notes that deal
    with statistics, but I don't think I've found what I'm looking for.
    OK, statistics gurus, if any of you can offer some insight here,
    it would certainly be appreciated.  I'm taking a home-study course
    in statistics, and I think I'm having some problems in understanding
    some of the basics.  Let me lay out a test problem to explain:

    In an experiment, there are twelve possible resulting values
    that can occur after an operation.  For the sake of simplicity,
    we'll call those values "A" thru "L".  The experiment MUST
    produce one and only one of those values after each operation.
    The resulting value is supposedly truly "random".  (Please, no
    ratholes about whether "truly random" really exists or not, just
    assume it does.)  You have the results of the last 'x' operations,
    where 'x' is at least 100 but less than 5000.  (Does this even
    matter, or is this statement a ringer?)

    Now I understand that statistically speaking, each of the 12
    possible results should have occurred one-twelfth of the time;
    or if X is the number of historical operations we have on file,
    then each value's statistically accurate appearance could be
    computed by (X * .0833).  Right?  

    What else can be said about this at this point in time, assuming
    we have X operation results on file?  The text makes reference to
    "inferential statistics" a few times, but nowhere does it explain
    exactly what that is?  Is there a way to "accurately predict"
    (given some "percentage of assurance") what the next result should
    be?  It is clearly (?) more than just determining which result
    has occurred the least to date?  The text makes reference to
    taking the standard deviations, then grouping, allowing us to
    determine 67% or 98% sureness, but I don't understand that, nor
    why/how that works.

    Any insights would be appreciated.  BTW, if it matters, the text
    being used is "Statistics Without Tears" by Rowntree.

    Thanks
    Jon
T.RTitleUserPersonal
Name
DateLines
1394.1ELIS02::GARSONV+F = E+2Tue Mar 05 1991 11:5158
re .0

>    OK, statistics gurus, if any of you can offer some insight here,
>    it would certainly be appreciated.
    
    I'm no statistics guru but I did study some a little over a decade ago.
    
>    The resulting value is supposedly truly "random".  (Please, no
>    ratholes about whether "truly random" really exists or not, just
>    assume it does.)
    
    As I remember it, a random variable has two properties
    
    a) short term unpredictability
    b) long term predictability
    
>    Now I understand that statistically speaking, each of the 12
>    possible results should have occurred one-twelfth of the time;
>    or if X is the number of historical operations we have on file,
>    then each value's statistically accurate appearance could be
>    computed by (X * .0833).  Right?  

    From your statement it would appear that your random variable is
    assumed to be uniformly distributed. "Random" and "uniform" are not the
    same thing.
    
    Personally I feel that you would be better off starting with something
    simpler than a random variable with 12 possible values. Why not start
    with the work horse of introductory statistics - the unbiased coin?
    
    Imagine you toss this coin 100 times counting heads or tails. Would you
    expect exactly 50 heads and 50 tails? No...not in the sense that
    failure to achieve this is proof of a biased coin. (After all what
    happens after 101 tosses. You can't have 50� heads.) For any given
    outcome (number of heads) we can calculate the probability of the
    outcome assuming an unbiased coin and use that as evidence for or
    against the coin being unbiased.
    
    There is a mathematical quantity called the "expectation" which would
    indeed be 50 heads. The expectation is just the number of heads in an
    outcome times the probability of the outcome summed over all possible
    outcomes. But don't expect the expectation (if you see what I mean).
    
    >Is there a way to "accurately predict" (given some "percentage of
    >assurance") what the next result should be?
    
    A random variable exhibits short term unpredictability. It is not
    possible to predict the next result regardless of what information you
    have collected.
    
    >It is clearly (?) more than just determining which result has occurred the
    >least to date?
    
    Correct. I believe the idea that the result which has occurred least to
    date is the most probable next result is called the Monte Carlo
    fallacy.
    
    Hoping this is of some use and awaiting the real statistics gurus...
1394.2some more stat helpCSSE::NEILSENWally Neilsen-SteinhardtWed Mar 06 1991 16:5175
>    assume it does.)  You have the results of the last 'x' operations,
>    where 'x' is at least 100 but less than 5000.  (Does this even
>    matter, or is this statement a ringer?)

For many statisticians, there is a lower bound of sample size, and below that
bound you don't do statistics, because there is "insufficient data".  There
is also an upper bound where simplifying assumptions make most techniques
unnecessary.  The numbers above say that this is a problem designed for 
such statisticians: there is "enough data" but not "too much data."

>    Now I understand that statistically speaking, each of the 12
>    possible results should have occurred one-twelfth of the time;
>    or if X is the number of historical operations we have on file,
>    then each value's statistically accurate appearance could be
>    computed by (X * .0833).  Right?  

As .1 says, random is not necessarily uniform.  The common assumption in 
this type of problem is that your experiment produces random but not 
necessarily uniform results.

>    What else can be said about this at this point in time, assuming
>    we have X operation results on file?  The text makes reference to
>    "inferential statistics" a few times, but nowhere does it explain

This is sounding to me like a traditional class, so I will give the traditional 
approach.

First you compute some descriptive statistics, like this.  Define XA as the 
number of times result A appears, XB similarly and so forth.  Then define a
histogram of the X's.  Then compute FA = XA/X, FB = XB/X and so forth.  Note
that the F's are all bounded by zero and one, and must sum to one.  These F's
summarize the X data, and by the assumption of randomness, summarize it
completely.

These F's are also an approximation to the underlying probabilities of the
experiment.  Call PA the probability of getting result A in a given test, then
FA is an approximation to PA.

>    exactly what that is?  Is there a way to "accurately predict"
>    (given some "percentage of assurance") what the next result should
>    be?  It is clearly (?) more than just determining which result
>    has occurred the least to date?  The text makes reference to

Some statisticians, like me, would allow you to use this approximation to say 
the probability that the next result will be A is PA, approximately FA.  Note 
that this says the result occuring most to date is most likely.  Note also 
that to do this right, you need to analyze the approximation, which is too
complex for this reply.  Note finally that many statisticans consider this a
completely illegitimate statement.

To apply inferential statistics, you have to decide what inference (or 
hypothesis) you want to test, and then pick the appropriate statistic to test 
it.

A typical hypothesis to introduce here is that all the PAs are equal.  Then one
asks whether the FAs are such as to rule out that hypothesis, to some degree
of certainty.  The most common statistic to test this is the chi-square,
computed by the sum of 

	( FA - 0.0833 ) ^2 / FA

over A, B, and so forth.  You also need the "degrees of freedom", which for 
this type of problem is one less than the number of categories, or 11, and your
desired level of confidence, for example, 0.01.  You look in the back of the 
book for a table of chi-square, and compare the entry there to the number you
computed above.  If the number you computed is larger, than you can rule out
the hypothesis that the PAs are equal.

>    taking the standard deviations, then grouping, allowing us to
>    determine 67% or 98% sureness, but I don't understand that, nor
>    why/how that works.

This is good inferential statistics, but for another problem.  It does not 
apply here unless your course is being very sloppy.

1394.3From some further experimentation...TIGEMS::ARNOLDSome assembly requiredFri Mar 08 1991 12:5130
    OK, here's what I did to shed additional light on this.  (Whereas
    in fact, it actually made the whole issue muddier for me...)

    I wrote a quickie program to generate 5000 variables, where the
    value of the random variable must be a whole number between 1
    and 12 inclusive.  (VAX BASIC RANDOMIZE and RND functions -- does
    that mean *really* random?  Does that mean "uniform"?)

    I then went thru random samplings of the data, taking anywhere
    from the last 200 sampling from any one point in time to about
    1200 previous samplings from any one point in time.  As .2
    suggested, I applied the formula there (FA - 0.0833)^2 / FA)
    for FA, FB, FC, etc.  What did that tell me?  The values
    computed for each of the 12 variables was not equal, really
    not even looking close.  Was I to be looking for something
    specific when doing this?  Did this actually tell me something
    that I'm just not seeing?

    By picking a point in time "X" and being able to look back
    "Y" samplings, but already knowing the result of event "X+1",
    I'm trying to determine what I would be looking for in the
    data that would have led me to believe that the actual result
    achieved in sample "X+1" was what I would have predicted based
    on that data.

    Did I miss something, is this more statistical than I can hope
    to understand, or am I mis-applying statistical rules?

    Thanks for any & all help
    Jon
1394.4some analysis of what you have doneCSSE::NEILSENWally Neilsen-SteinhardtFri Mar 08 1991 13:1546
>    I wrote a quickie program to generate 5000 variables, where the
>    value of the random variable must be a whole number between 1
>    and 12 inclusive.  (VAX BASIC RANDOMIZE and RND functions -- does
>    that mean *really* random?  Does that mean "uniform"?)

Assuming that you did something like N = FIX(1 + 12*RND), then that is nearly
uniform.  Technically the series you generated is pseudo-random, but last I
heard it would take a roomful of VAX-9000s and a few people-years of analysis
to detect the flaws in the current Random function in the RTL.  So don't worry
about it, and assume you have a uniform, random distribution.

>    I then went thru random samplings of the data, taking anywhere
>    from the last 200 sampling from any one point in time to about
>    1200 previous samplings from any one point in time.  As .2
>    suggested, I applied the formula there (FA - 0.0833)^2 / FA)
>    for FA, FB, FC, etc.  What did that tell me?  The values

Nothing yet.

>    computed for each of the 12 variables was not equal, really
>    not even looking close.  Was I to be looking for something
>    specific when doing this?  Did this actually tell me something
>    that I'm just not seeing?

You need to add the twelve values you get, then compare them to the chi-square 
value.  Call the sum of the twelve values CHI-SQUARE.  I predict that nine times
out of ten, CHI-SQUARE is less than 5.  This will confirm that RND is pretty
good at producing a uniform distribution.

Oops, I saw an error in the formula above.  It should be

	SUM of (FA - 0.0833...)^2 / 0.0833...

>    By picking a point in time "X" and being able to look back
>    "Y" samplings, but already knowing the result of event "X+1",
>    I'm trying to determine what I would be looking for in the
>    data that would have led me to believe that the actual result
>    achieved in sample "X+1" was what I would have predicted based
>    on that data.

Once you know that this is a uniform, random distribution, there is no more 
information to be obtained from any past sequence.  In principle, you could
do much more careful analyses of the sequences to detect and exploit the 
deviations of RND from a true random number, but then we are back to the
roomful of VAX 9000s.

1394.5Corrections.CADSYS::COOPERTopher CooperFri Mar 08 1991 14:1451
RE: .4 (Wally Neilsen-Steinhardt)

>>    and 12 inclusive.  (VAX BASIC RANDOMIZE and RND functions -- does
>>    that mean *really* random?  Does that mean "uniform"?)
>
>Assuming that you did something like N = FIX(1 + 12*RND), then that is nearly
>uniform.  Technically the series you generated is pseudo-random, but last I
>heard it would take a roomful of VAX-9000s and a few people-years of analysis
>to detect the flaws in the current Random function in the RTL.  So don't worry
>about it, and assume you have a uniform, random distribution.

    A bit of a nit.  Unless the BASIC RTL uses something other than the
    standard VMS math RTL or the "current" RTL is different from the one we
    have on our system, you have overstated the case by quite a bit.
    The RTL uses a LCG, and it is pretty easy to detect the flaws if you
    know how.  If you plot the positions of N-tuples in N-space all the
    points fall in nice neat hyperplanes.  For small N, and properly chosen
    parameters, the planes are very close together and so constitute a
    minor deviation in the low-order bits.  For even moderate N, however
    (say 10) the effect is unavoidably quite severe.

    Your conclusion, however, is correct.  For low-dimensionality (in this
    case N=1) and relatively small sample size (i.e., 5000), this generator
    is quite sufficient.  Care should be taken in using it for more
    complicated simulations, however.


>You need to add the twelve values you get, then compare them to the chi-square 
>value.  Call the sum of the twelve values CHI-SQUARE.  I predict that nine times
>out of ten, CHI-SQUARE is less than 5.  This will confirm that RND is pretty
>good at producing a uniform distribution.

    'Fraid you'll loose, Wally, *unless* the independence assumption of the
    chi� test is violated rather badly.  You forgot that with 12
    frequencies we're talking about chi� with 11 degrees of freedom.  (For
    the uninitiated, chi� is not a single distribution but a family of
    distributions each one distinguished by a parameter called degrees of
    freedom (df).  For problems of this specific kind, the member of the
    family to use is the one whose df is one less than the number of "bins"
    you are counting things in.  Relatively minor variants of the problem
    will result in different rules for determining the correct df to use).
    The mean of a chi� distribution is equal to the number of degrees of
    freedom (in this case 11.0), so it would be rather surprising if 90% of
    the time the chi square value would be less than half the mean.

    Checking a chart, I find that the result should be less than 5 between
    1 time in 20 and 1 time 10 (.05 level is 4.57; .1 level is 5.58).  It
    should be less than 17.3 nine times out of ten, and less than 19.7 19
    times out of 20.

					    Topher
1394.6spectral testGUESS::DERAMODan D'EramoFri Mar 08 1991 17:1915
        re .5,
        
>>		 If you plot the positions of N-tuples in N-space all the
>>    points fall in nice neat hyperplanes.
        
        I've always wondered whether for these N-tuples they
        intend
        
        	(x1,x2,...,xN), (x2,x3,...,xN+1), (x3,x4,...,xN+2),...
        
        or
        
        	(x1,...,xN),(xN+1,...x2N),(x2N+1,...,x3N),...
        
        Dan
1394.7spectral testALLVAX::JROTHI know he moves along the piersFri Mar 08 1991 22:566
    Re .6

    They mean the former (cyclic shift thru the sequence) but in
    practice it really doesn't matter...

    - Jim
1394.8Both or neither or one or the other.CADSYS::COOPERTopher CooperMon Mar 11 1991 12:4128
RE: .6 (Dan)

    They don't really mean either one, although the first comes closer to
    the truth, the second comes closer to practical application.

    Any n-tuple generated by a linear congruential generator will occur on
    one of a limited number of hyperplanes.  The location of the hyperplane
    does not depend on the "phase" of the n-tuple relative to any
    particular, arbitrary starting point in the PRNGs cycle (for that
    matter, for n r.p to the PRNGs cycle length, any n-tuple may be
    considered to be at all phases).

    If you want to "see" the hyperplanes by some plotting technique, you
    can choose overlapping or non-overlapping n-tuples, or choose a random
    x1 by an appropriate independent means and use that an the next n-1
    values for the point.

    In considering how this flaw might influence actual applications, you
    would normally be worried about generating n values for one simulation
    "trial" and then generating n more for the next, and therefore the
    sequencial non-overlapping case would be what you would actually be
    worrying about.

    The spectral test operates algebraically on all the n-tuples (and
    therefore on overlapping ones) to characterize how much space there is
    between the planes.

					Topher
1394.9a nit for a nitCSSE::NEILSENWally Neilsen-SteinhardtMon Mar 11 1991 13:0324
.5>    A bit of a nit.  Unless the BASIC RTL uses something other than the
>    standard VMS math RTL or the "current" RTL is different from the one we
>    have on our system, you have overstated the case by quite a bit.
>    The RTL uses a LCG, and it is pretty easy to detect the flaws if you
>    know how.  If you plot the positions of N-tuples in N-space all the
>    points fall in nice neat hyperplanes.  For small N, and properly chosen
>    parameters, the planes are very close together and so constitute a
>    minor deviation in the low-order bits.  For even moderate N, however
>    (say 10) the effect is unavoidably quite severe.

Yes, I knew about this.  In fact, I once plotted on a GT40 the PDP-11 random
number generator output using N=3.  It fell into about 5 planes.  The VMS 
RTL people were aware of this problem, and chose a much different multiplier,
69069.  I never checked, but I think it avoided this problem, in the sense that
the number of planes for N=3 became very large.

The nit here is that you don't get all the information: the floating point 
values have been scaled and truncated into the 12 bins.  If N=10, you will have
12^10 = 60 billion bins to fill.  

I believe there are ways of determining the seed from the data, but the only 
one I can think of is brute force: generate the entire 2^32 sequence of 
integers, and then pattern match your sequence against it.  I think that a 
sequence of about ten integers is likely to be unique.
1394.10you got me on this oneCSSE::NEILSENWally Neilsen-SteinhardtMon Mar 11 1991 13:087
.5>    'Fraid you'll loose, Wally, *unless* the independence assumption of the
>    chi� test is violated rather badly.  You forgot that with 12
>    frequencies we're talking about chi� with 11 degrees of freedom.  (For
>    the uninitiated, chi� is not a single distribution but a family of

No, I remembered about the 11 degrees of freedom, but misread the column
headers on my table.  I should have said 18 and not 5.
1394.11Nits cubed.CADSYS::COOPERTopher CooperMon Mar 11 1991 15:3654
RE: .9 (Wally)

    Yes, the multiplier they used (69069) is quite good for small N.  But
    one only checks LCG parameters for small N -- up to 6 or 8.  The
    behavior gets worse pretty quickly with increasing N.  As I remember
    (don't ask why I should remember this), the upper-bound for the number
    of hyperplanes for an LCG is:

			1/N
		(N! * M)

    with N being the dimensionality (number of elements in the n-tuple) and
    M begin the modulus used (which for the VMS PRNG is 2^31).  When N is
    3, the upper bound comes out to 2344 (non-hyper)planes, and the
    particular choice of multiplier does a good job of near-optimality.
    For N=10, however, the best that any multiplier could accomplish would
    be about 38 planes, and no one bothered to see how well the multiplier
    actually approached that limit.

    Yes, if you are talking about calculations after you have reduced the
    number to an integer between 0 and 12, then if the multiplier happens
    to be nearly optimum for this dimensionality, then the effect will be
    slight (since each "bin" will contain multiple 9-planes), but I was
    talking about weaknesses in the underlying routine (as I thought you
    were also).  (And, of course, we have virtually no assurance that this
    multiplier comes anywhere close to optimality for N=10).

>I believe there are ways of determining the seed from the data, but the only 
>one I can think of is brute force: generate the entire 2^32 sequence of 
>integers, and then pattern match your sequence against it.  I think that a 
>sequence of about ten integers is likely to be unique.

    So far I have limited myself to reasonable statistical flaws, i.e.,
    flaws that a reasonable statistical test might reveal without any
    special knowledge of the underlying algorithm.  You bring up algebraic
    "weakness", however.  People interested in cryptography have developed
    quite efficient techniques for determining a seed for an LCG given a
    few adjacent numbers derived from the high-order bits.  Theoretically,
    10 values in the range 0-11 should be enough or almost enough to
    determine the seed if the multiplier, modulus and additive factor are
    known.  In practice a few more might be needed.

    Basically, each new 0-11 value gives you lg(12) bits of information
    (where lg is log-base-2), or about 3.6 bits of information.  Nine
    values would give us 32.4 bits of information which is slightly more
    than enough (since we need 31 bits).  As it happens, though, a
    particular sequence of values may overdetermine parts of the answer,
    and therefore underdetermine others.  An additional value or two may
    be necessary to disambiguate.

    So 10 values are probably unique, but may not be.  Nine values may be
    unique.

					    Topher
1394.12not so goodRDVAX::GRIESTue Mar 12 1991 16:129
!    Yes, the multiplier they used (69069) is quite good for small N. 

The current rtl RNG has a rather large failing. If one looks at the seg of
numbers generated by bit position, they cycle cooresponding to their 
bit position. bit 0 cycles from 0 1 0 1 0 1 0 ...
bit 1 cycle every four numbers.
bit 24 cycles 2**25 numbers.
bit n cycles every 2**n+1 numbers.
1394.13Nit ** N'thALLVAX::JROTHI know he moves along the piersTue Mar 12 1991 19:0519
>    with N being the dimensionality (number of elements in the n-tuple) and
>    M begin the modulus used (which for the VMS PRNG is 2^31).  When N is
                                                         ^^^^
    2^32...

    The number 69069 was chosen by Marsaglia after an exhaustive
    search using the spectral test for k-tuples for k = 1 to 5.
    It falls down at 6 dimensions - I once wrote a little
    program that sent the normal to that hyperplane in dimension
    6 to the x axis by a a Householder reflection and plotted
    the x and y components of 6-tuples rotated by that transformation.

    A bunch of lines showed right up!

    It's no problem to fix this - just use another generator to
    shuffle the sequence returned by your LCG and you will avoid the
    hyperplane problem.

    - Jim
1394.14How about a commercial break for a minute?TIGEMS::ARNOLDSome assembly requiredTue Mar 12 1991 22:0716
    I hate to break up the fun in this rathole, but back to one of the
    original questions: is there a statistical method by which one could
    say, for example, that based on the history of events to date, event
    'A' is more likely to occur next than event 'B'?
    
    Or putting a new twist on the question, if you were willing to assume
    (for whatever reason) that events 'A', 'B', 'C', and 'D' were NOT going
    to occur as the result of the next experiment, does that lend any help
    in determining which of the remaining 'E' thru 'L' events should occur
    as the result of the next experiment?
    
    I promise that after I get an answer here, you may return undisturbed
    to this rathole...   :-)
    
    Thanks
    Jon
1394.15JARETH::EDPAlways mount a scratch monkey.Wed Mar 13 1991 07:5744
    Re .14:
    
    I think the text you were reporting about in .0 was trying to make a
    point like this:
    
    Suppose you have a set of trials and the results of each, and you know
    there are twelve possible results.  Now the thing is you do NOT know
    how likely each result is.  Yes, they are random, but you do not know
    that each result has an equal probability.  Your goal is to try to
    figure out something about the probabilities.
    
    Well, one way you can do this is to assume, for the moment, that each
    event has an equal probability.  Then you look at the set of results
    you have.  To "look at" them, you analyze them with various statistical
    tests to see if the results are, considered all together, something you
    are likely to get if each event has equal probability.  If you got
    equal numbers for each event or close to equal numbers for each event
    (and "close" is told to you by the statistical tests you apply), then
    you say "Okay, the results are close to what I would expect most of the
    time if each event had equal probability.".  But if the results aren't
    close to what would happen most of the time, then you say "Hey, the
    results aren't close.  Now, this could be a fluke, since these are
    random trials, but this is so unlikely I suspect that the probabilities
    for the different results aren't equal.".
    
    There, you've done statistics.
    
    You could also work with different assumptions.  Instead of assuming
    the probabilities are equal, you could assume they are some other
    specific values, and then you would run statistical tests to see if the
    results are within what would happen most of the time with those
    probabilities.
    
    One point with statistics is that it can never really tell you that A
    is more likely to occur than B.  It does tell you that A has been
    occurring a lot more than you would expect if it were equally likely. 
    Now that is not a deductive proof, but it is good enough for human
    beings.  At that point, we tend to say "All right, something's fishy
    here, I bet A is actually more likely than B.".  That's not because it
    is more likely just on the next trial, but because it has been more
    likely all along and now you have enough information to suspect that.
    
    
    				-- edp
1394.16YupCADSYS::COOPERTopher CooperWed Mar 13 1991 11:2620
RE: .13 (Jim)

>    2^32...

    Right you are, Jim, I had misremembered.  Increase the best possible
    number of hyperplanes for 3-tuples by 2^(1/3) and the maximum number of
    hyperplanes for 10-typles by 2^(1/10) in my previous statements.


>    It's no problem to fix this - just use another generator to
>    shuffle the sequence returned by your LCG and you will avoid the
>    hyperplane problem.

    If you are going to use another generator, you might as well use a good
    one in the first place and not bother with the mixing.  Alternately,
    you can use the "self-shuffle" recommended in the second edition of
    Knuth vol 2 -- which is faster, doesn't need an "additional" generator,
    and seems to be just as good.

					Topher
1394.17Non-uniformity exampleCIVAGE::LYNNLynn Yarbrough @WNP DTN 427-5663Wed Mar 13 1991 13:244
Note 95 discusses a problem centered around a pseudo random digit generator 
that appears to be non-uniform, for reasons which are not yet known.

Lynn 
1394.18one man's rathole is another man's answerCSSE::NEILSENWally Neilsen-SteinhardtWed Mar 13 1991 15:0066
.14>    I hate to break up the fun in this rathole, but back to one of the
>    original questions: is there a statistical method by which one could
>    say, for example, that based on the history of events to date, event
>    'A' is more likely to occur next than event 'B'?

Unfortunately, this 'rathole' is providing the best answer to your question.  .1
.2 and .15 offer some other answers.

Perhaps a classification will make the situation clearer.  Assume you have a 
sequence of integers chosen from the set {1 ... 12 }, which shows no obvious
pattern.  If you allow me to use an imprecise word like 'random', then there
are four possibilities:

1. the sequence is uniform and random

2. the sequence is random but not uniform

3. the sequence is uniform but not random

4. the sequence is neither uniform nor random

Now lets look at the answer to your question for these four cases:

1.  In this case, analysis is a waste of time.  The probability of any integer
from the set coming up next is 1/12.  No amount of analysis and no statistical
technique will give you a better answer.

2.  In this case you can get enough sample data to estimate the probability 
that each integer will come up, called the probability distribution.  Once 
you have a good estimate of this distribution, you are done.  No amount of 
analysis and no statistical technique will give you a better answer.  There are
some cute techniques for putting bounds around your guesses for the next value,
but I'm not sure you are interested.

Note, while we are here, that the descriptive statistics of .2 apply to case 2 
above, and the inferential statistics of .2, .4 and .15 are a way of 
distinguishing between case 1 and case 2.

3. In this case you need to look for the deviations from randomness and 
exploit them.  There are basically two ways to do this: clear box (the good way)
and black box (the bad way).  The clear box method looks at the mechanism 
generating the sequence and searches for relationships among the elements.
The 'rathole' discussion here is just such a search for a particular mechanism.
If this is the mechanism of interest, then the 'rathole' provides the answer.

The black box method assumes that we cannot look directly at the mechanism, so
we analyze the sequence itself to see what clues it provides about deviations
from randomness.  A lot of stock-market and economic forecasting is done this 
way.  This is a well-studied but complex field.  If you are interested, check
the subject index of a large library under Statistics, Time Series Analysis.

4. This case introduces nothing new, it is just a combination of .2 and .3, so
techniques from both those cases apply.  In practice, case 4 is the one treated,
since it is much more common than case 3.

.14>    Or putting a new twist on the question, if you were willing to assume
>    (for whatever reason) that events 'A', 'B', 'C', and 'D' were NOT going
>    to occur as the result of the next experiment, does that lend any help
>    in determining which of the remaining 'E' thru 'L' events should occur
>    as the result of the next experiment?

All depends.  If you have no rational reason for that assumption, then it does 
not help.  If you have a reason, then it must be based on experience (black box)
or mechanism (clear box).  There are various statistical techniques for 
converting these rational reasons into probability distributions on E thru L
events.
1394.19"How to statistic without lies."CADSYS::COOPERTopher CooperWed Mar 13 1991 15:26172
RE: .14 (Jon)

>    I hate to break up the fun in this rathole, but back to one of the
>    original questions: is there a statistical method by which one could
>    say, for example, that based on the history of events to date, event
>    'A' is more likely to occur next than event 'B'?

    OK, OK, Jon.  I'll try to answer your question.  SHEESH, killjoy :-)

    There are two main branches of "inferential statistics": statistical
    test theory and descriptive statistics.  Statistical test theory, in turn
    can be broken down into tow subfields.  The more general one is called
    "significance testing" and can be defined as the field which studies
    methods of answering simple yes/no questions with vague maybes.
    Traditionally however, partially because of the lack of computers and
    partially because of the simplicity of the answers, the other subfield
    of "hypothesis testing" is done.  Hypothesis testing can be defined as
    the field which studies methods of answering simple yes/no questions
    with inaccurate, misleading and frequently meaningless yes/no answers.

    The answers you've been getting so far have been answers in terms of
    statistical testing theory.  What you are probably interested in is
    answers in terms of descriptive statististics.  Descriptive statistics
    also has two main branches.  There is sample descriptive statistics,
    which is the art of getting precise and accurate descriptions of
    something that you have in front of you so that you probably are not
    interested in describing.  The other branch, which is called
    statistical estimation theory, uses sample descriptive statistics to
    derive descriptions of something that you hope are correct and which
    you hope is what you are interested in.

    Anyway, on to your "experiment".  What you have is some number of
    trials, lets call the number N, to which there are 12 possible
    outcomes.  You also have 12 numbers Na, Nb, ..., Nl, representing
    the number of times each of the outcomes occurred.  Of course, N equals
    the sum of the N*.

    What those N* represent is a description of the sample you have of all
    possible outcomes, so these values represent your basic "sample
    descriptive statistics" in this case.  Other sample statistics can be
    derived from these.  For example, we can talk about the frequency
    of each outcome in the sample, fa, fb, ..., fl; which equals Na/N,
    Nb/N, ... Nc/N.

    We are interested, however, not in the description of the sample for
    its own sake, but in what it tells us about the "population from which
    it was drawn", i.e., what it tells us about all experiments of that
    type.  (What we are really, really interested in, is what it says about
    the outcome of some future experiment.  Bayesian statistics tries to
    answer that question, while traditional statistitions say quite
    vehemently that it is a meaningless to ask such questions.  They don't
    say it too loudly, however, since they realize that no one would bother
    to use their statistics if they couldn't at least *pretend* that it is
    answering such questions.)

    To answer questions about the population, however, we have to make some
    assumptions.  We don't have to assume that the outcomes are equally
    likely but we do have to make some assumptions about how the sample was
    drawn from the population.  The simplest useful assumptions -- about the
    only one which might be  justified given the information you supplied
    -- is that each trial is "independent" of the others (i.e., the outcome
    of one trial is not dependent on the outcome of any others), and that
    it is "time-invariant" (i.e., that the liklihood of a particular
    outcome occuring does not change from sample to sample).

    Now at this point you might ask the question "Is the data consistent
    with the hypothesis that all the outcomes are equally likely?"  You
    can then answer this using the techniques previously stated.  In
    significance testing you will get back a number between 0 and 1, called
    the significance, which tells you how consistent the data is with the
    hypothesis.  In hypothesis testing you set a threshhold which, if
    exceeded allows you to say "it is inconsistent enough so we are not too
    likely to be wrong if we simply assume that the hypothesis is false."
    If it is less than the threshhold, then you can't technically reach any
    conclusion, but people generally will say under those circumstances
    that the hypothesis is at least a fairly accurate description of the
    situation (whether or not that is actually reasonable depends on
    something else called the "power" of the test).  Anyway, people have
    already explained to you how to do this.

    What you are really interested in, though (as I said, you are really,
    really interested in something else, but let that pass), is the
    following question: if I performed the experiment an *infinite* number
    of times, what proportion of the time would the outcome be A, what
    proportion would the outcome be B, etc.?

    You cannot really answer that question unless you actually perform the
    experiment an infinite number of times (which is generally not
    practical :-)).  But you can make an *estimate*, on the basis of the
    sample statistics which you have, of what answer you would get if
    you could do that.

    One thing you can do is use "point estimates".  A point estimate is a
    number which is your "best guess" as to what the value in question
    really is.  Exactly what "best" means can vary quite a bit depending on
    what you want, but it turns out, for this problem, almost any
    definition of "best" will give you the same value.  The best estimate
    for Fa, the frequency that outcome A occurs in the infinite population
    of experiments, is simply fa (in books that would be written as f-with-
    a-hat(caret)-subscripted-by-a; while Fa would be written the same
    without a hat).  So you can estimate (knowing that the estimate is
    almost certainly wrong, but that its as good an estimate as you can
    manage with the information you have) Fa with fa, Fb with fb, etc.
    You should be warned, however, that these estimates are interdependent:
    if one of them is considerably too high (because, just by chance, some
    extra trials happen too fall their) then the others will tend to be
    too low (since those extra trials "should" have gone to them).

    The problem with point estimates is that they don't communicate
    anything about how good the estimate is likely to be.  You get the same
    point estimate whether you have a lot of sample data (in which case the
    estimate is likely to be pretty accurate) or almost none (in which case
    the estimate is likely to be virtually meaningless).   The solution is
    to use "interval estimates".

    In interval estimation, instead of guessing a single value, you guess an
    interval within which you think the actual value probably falls.  If
    you have a large sample on which to make your estimate, then the
    interval will be small, while if you have a small sample the interval
    will be large.

    What you would like is to be able to name a pair of numbers so that you
    can be P% sure (for some specific value of P) that the actual value
    lies between the two numbers.  Once again, the traditional
    statisticians say that you can't do that (since the actual value either
    is or is not between the two numbers).  What the confidence interval
    (as its called) really means is obscure and confusing, so I won't
    bother explaining.  Go ahead and pretend it means what you want it to
    mean -- almost everyone else does too, including most of those who know
    "better."

    The actual formula for calculating the confidence interval in this case
    is pretty complicated.  There is a simple approximation (that's right,
    we are talking about approximating our estimation), which works as
    long as the sample is big enough, and the f*'s are not too small or
    too big.  It is:

	bound = f � (z(P)*sqrt(f*(1-f)/n) + 1/2N)

    f is the sample frequency (fa, fb, etc.), N is the sample size, and
    z(P) is a value based on the standard normal distribution, which
    determines whether this is a 90% confidence interval, a 95% confidence
    interval or a 99% confidence interval:

	    z(.90) = 1.645
	    z(.95) = 1.960
	    z(.99) = 2.576

    Once again, these are interdependent.  If one is low the others are
    likely to be high.

>    Or putting a new twist on the question, if you were willing to assume
>    (for whatever reason) that events 'A', 'B', 'C', and 'D' were NOT going
>    to occur as the result of the next experiment, does that lend any help
>    in determining which of the remaining 'E' thru 'L' events should occur
>    as the result of the next experiment?

    In the above discussion, replace N everywhere by

		    N(e-l) = Ne+Nf+Ng+Nh+Ni+Nj+Nk+Nl

    continue as before.  (This is known as conditional probabilities.  In
    this case its pretty simple, but it can get complex and
    counter-intuitive, see the notes on the Monty hall paradox and on
    "illnumeracy").

>    I promise that after I get an answer here, you may return undisturbed
>    to this rathole...   :-)

    It may be a rathole to you, but to us rats its home.

				    Topher
1394.20I think we're getting somewhere!SMEGIT::ARNOLDSome assembly requiredMon Mar 18 1991 20:2432
    Topher (.19),

    Thanks for the very lengthy & informative response.  I think the last
    part of your note is what I was hoping to find out when I originally
    wrote the base note.  However, some questions:

>	bound = f � (z(P)*sqrt(f*(1-f)/n) + 1/2N)
>
>    f is the sample frequency (fa, fb, etc.), N is the sample size, and
>    z(P) is a value based on the standard normal distribution, which
>    determines whether this is a 90% confidence interval, a 95% confidence
>    interval or a 99% confidence interval:
>
>	    z(.90) = 1.645
>	    z(.95) = 1.960
>	    z(.99) = 2.576

     * I understand when you say that "N" is the sample size, but is that
     the same as "n" (lower case vs upper case), or is "n" supposed to
     represent something else?

     * Presumably I will be two values for "bound" for each variable
     "A" thru "L". (One for "f plus etc" and one for "f minus etc".)
     What are those values telling me?  For example (and I haven't done
     the math yet, just pulling numbers out of the air), say I get
     values of 1.23 and 2.34 for "A" and 1.88 and 2.99 for "B" after
     doing the math shown above.  What did those values tell me about
     the statistical likelihood of "A" occuring as the result of the
     next experiment over "B".  (If anything)?

     Thanks again
     Jon
1394.21Clarification.CADSYS::COOPERTopher CooperTue Mar 19 1991 11:2558
RE: .20 (Jon)

>     * I understand when you say that "N" is the sample size, but is that
>     the same as "n" (lower case vs upper case), or is "n" supposed to
>     represent something else?

    They're the same.  Its a typo.  You don't think I actually took the
    time to proofread all that, do you? :-)

>     For example (and I haven't done the math yet, just pulling numbers out
>     of the air), say I get values of 1.23 and 2.34 for "A" and 1.88 and
>     2.99 for "B" after doing the math shown above.  What did those values
>     tell me about the statistical likelihood of "A" occuring as the result
>     of the next experiment over "B".  (If anything)?

    THESE particular values wouldn't tell you anything about the
    statistical likelihood of anything.  They would simply tell you that
    either I gave you the wrong formula, or you did your math wrong. :-)

    All bounds should come out between 0 and 1.  The assumptions made here
    mean that we can meaningfully talk about there being some real, specific
    probability that, for example, a particular trial will have outcome A.
    We don't know what that probability is, but, we can, on the basis of
    past behavior, estimate what that probability is.  The simplest thing
    we can do is make a point estimate and just say that "the probability
    of any particular trial having outcome A is probably approximately fa".
    This is frequently done, but the problem is that that statement doesn't
    tell us how probably or how approximately.

    The alternative is to use a confidence interval.  If you use the
    formula I gave with P=.95 and you had 100 trials, of which 40 came out
    with outcome A, then your 95% confidence interval for Fa (the unknown
    actual probability for outcome A) is roughly (.3, .5).  Ignoring the
    denials of the tradionalists that such a statement actually means
    anything at all, this means that you can be 95% sure that Fa is
    actually between .3 and .5.

    The same applies to the other outcomes.  You have to be a little
    careful, however, in interpretting all of the numbers at once.  There
    is some degree of interaction between them.  For example, not all of
    the outcomes "actual" values can be near their respective lower bounds
    or the total probability of any of the outcomes occuring would be less
    than 1.  Also, with 12 95% confidence intervals, there is close to a
    50% chance that at least one of the actual probabilities will be
    outside its calculated confidence interval.

    Nevertheless, these 12 confidence intervals will give you a good
    feeling for what the relative likelihoods of the various outcomes
    are.  I suggest "plotting" the results.  The x axis should range
    between 0 and 1 (although you might change that upper limit if your
    largest upper limit is relatively small).  The y axis should be your
    twelve outcomes (make two plots, one in which they are arranged A, B,
    C, D etc; and one in which the one with the smallest point estimate is
    at the bottom, the next largest above it, and so on).  For each outcome
    draw a horizontal bar from the lower bound to the upper bound with a
    dot at the center point (the point estimate).

					Topher
1394.22Should I be seeing something here?SMEGIT::ARNOLDSome assembly requiredMon Mar 25 1991 22:1830
    Topher,

    Thanks much for your clarifications.  Now I've setup a couple of Q&D
    (quick & dirty) programs to do the math based on my test file, going
    back 100, 200, 300, 400, & 500 experiments, seeing what values I get
    each time for each "A" thru "L" variable.  (I picked those numbers
    arbitrarily; figured less than 100 experiments would not be enough to
    base a good guess on, and more than 500 might tend to be too many?)

    I must have done the math right, because I almost always get lower and
    upper bound values between 0 and 1.  (Although I occasionally get a
    negative value?)  These values didn't appear to tell me much (maybe I'm
    still not sure what to look for?), so I took your suggestion and
    graphed the values, connecting the midpoints (midpoint = halfway
    between upper & lower bounds) of Fa thru Fl.  Nice graph, but is it
    supposed to tell me anything significant?  Just wondering if I'm
    looking for anything in particular; ie:

                           Z(.9)          Z(.95)           Z(.99)
                           -----          ------           ------
    Fa              0.0923  0.1797     0.0855  0.1865    0.0722  0.1998
    Fb              0.0539  0.1301     0.0482  0.1358    0.0369  0.1471
    Fc              0.0404  0.1116     0.0352  0.1168    0.0248  0.1272
       etc.

    If this is telling me something that I'm missing, please let me
    know what it is.

    Thanks
    Jon
1394.23Only what you should expect to see.CADSYS::COOPERTopher CooperTue Apr 09 1991 18:33128
RE: .22 (Jon)

    > ... and more than 500 might tend to be too many?)

    I wouldn't worry about "too many" too much.  What is too many depends
    on context.  The more samples you take the more sensitive your tests
    become.  If you take "too many" samples your tests become majorly
    influenced by effects that are so small that you want to ignore them. 
    Your computer simulation, however, is a pretty direct model of the
    process you are interested in.  The only extraneous factors (e.g.,
    slight deviations from ideal behavior of your pseudorandom number
    generator) are so small as to require a great many samples before they
    would interfere.  And there is less of an issue with descriptive
    statistics than with hypothesis testing in any case.

    You could unquestionably increase that to 50,000 samples (which would
    give you roughly one more decimal place of accuracy) and could probably
    go to 5,000,000 samples (roughly two more decimal places of accuracy).

    > (Although I occasionally get a negative value?)

    You probably get the negative values with the P=.99, N=100 case.  The
    formula I gave is an approximation.  The usual rule of thumb given for
    this (and related) approximations is that the Nx (the number of samples
    with outcome "x") should be at least 5 for it to be "reasonably
    accurate".  The approximation, however, is weaker in the "tails", and
    .99 is far enough out in the tail to really require more data.  Just
    don't take the P=.99/N=100 case too seriously.

    > These values don't tell me much.

    That is as it should be, since what they are intended to tell you is
    something that you already know: that the probability on each trial for
    the outcome to be A is approximately 1/12 (= .0833...), to be B is
    1/12, to be C is 1/12, etc.  You know that because you set up your
    program for that to be the case.  In "real life" you wouldn't know
    that, and the point is to determine if one outcome is more likely than
    another and by how much.

    Here are some observations you can make, based, in part, on your
    knowledge that the *real* underlying probability is 1/12.

	1) As the number of samples gets larger the confidence intervals
	   get narrower.

	2) As P increases from .9 to .95 to .99 the confidence intervals get
	   wider.

	3) The .9 confidence intervals contain the real probability value
           about 90% of the time.  The .95 confidence intervals contain the
           real probability value about 95% of the time and the .99
           confidence intervals contain the real probability value 99% of
           the time. (These are not independent observations, though, when
           the 1 time out of 100 event of the 99% confidence interval being
           "missed" occurs the .95 and .9 confidence intervals will
           obviously also be missed).

    Here are some experiments to try which will make the data "more
    interesting".

    EXPERIMENT 1:  Instead of generating a whole number between 1 and 12
    generate a number between 1 and 13 instead.  If the number is 13 change
    it to 1.  What this means is that outcomes 2 through 12 are still
    equally probable, but outcome 1 is twice as likely as any of them. Note
    the differences.

    EXPERIMENT 2:  Set up an array, P, indexed from 1 to 12.  Then do the
    following (in pseudocode -- sorry my BASIC is to rusty to be useful):

	Step 1:  FOR i FROM 1 TO 12 DO P[i] := urand(0.0, 1.0);
	    Where the call on "urand" returns a uniform random real value
	    between 0 and 1.

	Step 2: (Optional) sort_descending(P)
            This step will make the simulation run faster.  For this
            purpose, however, its probably not worth the trouble.  When
            this is finished, the largest of the 12 values generated in
            step 1 will be in P[1]; the next largest in P[2] and so on down
            to the smallest in P[12].

	Step 3: accum := 0;
		FOR i FROM 1 TO 12 DO
		    accum := accum + P[i];
		    P[i] := accum;
            Upon completion of this step P[1] will contain the value it had
            at the end of the previous step.  P[2] will have the sum of its
            old value and P[1]'s old value.  P[3] will have the sum of its
            old value and of P[2]'s and P[1]'s old value.  Etc. up to P[12]
            which will contain the total of all of the original values. 
            accum will also contain the total of all of the original
            values.

	Step 4: FOR i FROM 1 TO 12 DO P[i] := P[i]/accum;
            Upon completion of this step all the values will have been
            "normalized", with P[1] through P[11] less than 1 and P[12]
            equal to 1.  They will be in increasing order.

    What array P now contains is a representation for a "random"
    probability for each of the 12 outcomes.  Specifically, P[1] contains
    the probability that an outcome numbered 1 or less (i.e., outcome 1)
    will occur on any particular trial; P[2] contains the probability that
    an outcome numbered 2 or less (i.e., either outcome 1 or outcome 2)
    will occur on any particular trial; P[3] contains the probability that
    an outcome numbered 3 or less (i.e., outcome 1, outcome 2, or outcome
    3) will occur on any particular trial; up to P[12] containing the
    probability that an outcome numbered 12 or less (i.e., any of the
    outcomes) will occur on any particular trial (i.e., 1 since one of the
    12 outcomes must occur).  To use this array you do the following when
    you want to generate a random outcome.

	    X := urand(0.0, 1.0);
	    FOR i FROM 1 TO 12 DO
		IF (P[i] less-than-or-equal-to X)
		THEN
		    outcome := i;
		    LEAVE LOOP;

    The variable "outcome" will contain the number between 1 and 12
    representing the randomly chosen outcome.  Just conduct your simulated
    experiment as before but use this outcome rather than the all-equally-
    likely outcomes you used originally.  Now you *don't* know in advance
    what the values are meant to tell you and the simulation is more
    "realistic" in that regard.  (I wrote the above for clarity, there is
    obviously a lot of optimization that can be done in the initialization
    steps, but they are unlikely to make a noticeable difference in the
    overall simulation).

                                Topher
1394.24those pesky negative valuesCSSE::NEILSENWally Neilsen-SteinhardtWed Apr 10 1991 14:1925
.23>    > (Although I occasionally get a negative value?)
>
>    You probably get the negative values with the P=.99, N=100 case.  The
>    formula I gave is an approximation.  The usual rule of thumb given for
>    this (and related) approximations is that the Nx (the number of samples
>    with outcome "x") should be at least 5 for it to be "reasonably
>    accurate".  The approximation, however, is weaker in the "tails", and
>    .99 is far enough out in the tail to really require more data.  Just
>    don't take the P=.99/N=100 case too seriously.

There is another way to look at the negative values, which may clarify why they
appear and what to do about them.

By definition, all the sample Fx's must be bounded by zero and one.  The 
Px's of the underlying distribution, by definition, are similarly bounded by
zero and one.  But the error bounds in .19 are based on an approximate
distribution function, the Z or normal distribution, which is not so bounded.

The result is that when Fx happens to be close to zero, and z(P) is too large,
the value computed for the lower confidence limit can become negative.

One solution is to quietly replace negative numbers with zero.

Another is to use a properly bounded distribution, in this case the beta 
distribution.  This means more computation, but MIPS are cheap these days.