T.R | Title | User | Personal Name | Date | Lines |
---|
1803.1 | A superficial explanation | AMCCXN::BERGH | Peter Bergh, (719) 592-5036, DTN 592-5036 | Wed Oct 06 1993 14:39 | 24 |
| <<< Note 1803.0 by SNOFS1::ZANOTTO >>>
-< Maximum Likelihood ! >-
< Can anyone offer an explanation on maximum likelihood with regards to
< statistics ?
The maximum-likelihood technique was invented, if memory serves�, in the 1930s
by R A Fisher. It is based on the common-sense notion that the observed
outcome of an experiment probably is the most likely outcome. Thus, you have
an experiment whose result will vary with a number of parameters whose values
are unknown. You express the probability of the outcome of the experiment as a
mathematical function of the experimental values and the parameters and then
maximize that probability using standard techniques from calculus.
< Apparently there is a theorum to explain maximum
< likelihood.
I'm not sure what kind of theorem you are talking about that purports to
"explain" maximum likelihood.
==========================
� It's 25 years since I last taught statistics and my memory is not
perfect...
|
1803.2 | | STAR::ABBASI | white 1.e4 !! | Wed Oct 06 1993 17:06 | 5 |
| is talking about the maximum-likelihood event the same as talking about
the event with highest probability? (with respect to the output
from the same experiment) .
\nasser
|
1803.3 | Slipping Bayes in by the Back Door. | CADSYS::COOPER | Topher Cooper | Wed Oct 06 1993 17:32 | 67 |
| Fisher was basically trying to get the advantages of a Bayesian view
without abandoning a strict frequentist viewpoint (he explicitly
references Bayes in his paper introducing the concept).
Imagine that you are using a sample to estimate a particular quantity,
R. What Bayes talked about -- and what seems natural to many people --
is the "probability" that R equals some particular value r. From
a frequentist viewpoint this is meaningless -- R either is or is not
equal to r, there is no probability involved. There could only be
a probability if one had a sequence of experiments in which something
like the sample were generated, but there are many ways of constructing
such sequences (which are handled, in modern Bayesian statistics, by
the prior parameter distributions).
But you can define something which Fisher considered completely
different from a probability, which he called a likelihood. That
was a quantity proportional to the probability that the given
sample would be generated if you assume that R=r. (Note that this is
the opposite of the quantity of interest. You are interested in
p(R=r | sample-statistics),
but the likelihood is
C*p(sample-statistics | R=r)
Fisher left the constant of proportionality explicitly undefined -- if
you tried to set it to some value you were making likelihoods too
explicit.
Although the lack of a meaningful constant of proportionality leaves
likelihoods rather ghostly, they are still very useful. In particular,
if you are trying to compare various possible ways of estimating R,
you can choose the one which maximizes the likelihood, since the maximum
will be the maximum whatever the scale factor applied. The principle
of maximum likelihood says basically that the "best" estimator is the
maximum likelihood estimator -- if it exists. Generally maximum
likelihood is applied to derive a general analytic method -- most
of the familiar estimators (such as sample mean for estimating
population mean) meet the maximum likelihood criterion. Maximum
likelihood is frequently invoked explicitly when attempts tare made to
fit a complex model to a set of data -- multiple parameters are chosen
to meet the maximum likelihood criterion.
Likelihoods sometimes make a semi-explict appearance in classical
statistics in the form of likelihood ratios. Since the proportionality
constants cancel out, they are irrelevant, so one can calculate the
ratio of two likelihoods even though you cannot attach a single number
to either one.
Bayesians, of course, choose a constant of proportionality such that
the sum of the likelihoods of all the distinct alternatives is equal to
one. They then treat the result exactly like a probability. In
deference to frequentist, however, who tend to get upset if anyone
talks about the probability of unique events, Bayesians frequently
refer to these as likelihoods, even though they are manipulated
identically to probabilities.
Maximum likelihood provides a fundamental principle in modern
statistics. It is used to justify much of what is done. There are as
a result hundreds of theorems about it, dealing with its consistency as
a criterion, existence in various cases, asymptotic behaviors,
relationship to other criteria, optimality in various circumstances,
etc. I have no idea what would qualify as "the" theorem about maximum
likelihood.
Topher
|
1803.4 | Maximum likelihood ! | SNOFS1::ZANOTTO | | Mon Oct 11 1993 03:37 | 27 |
| Hello Topher
RE Maximum Likelihood
Thanks for the explanation in the math notes conference. Being a novice to
all this, most of the terminology used in your explanation went over my
head. It sounds like I might need to do some light/heavy reading into
the subject.
Two questions if I may, is it possible to go through a brief example to
illustrate how the equation works ? As for reading literature, which
books would you recommend, noting that I am a layman at this ?
The reason for all this, is that I am working on a problem in Australia at
the moment which involves neural networks. Recently I have discovered
that another guy, from the Sydney University, has been working on a similar
problem. The difference is that he has solved the problem and I have not.
The other day I called him for some helpful advise. This is what he told
me. The secret is to change the way that the neural networks alters its
internal weighting by using the principals of maximum likelihood. So here I
am.
Looking forward to your reply, Topher.
Regards
Frank Zanotto
|
1803.5 | An article. | CADSYS::COOPER | Topher Cooper | Mon Oct 11 1993 17:36 | 17 |
| I'm afraid I haven't read enough systematically in this area to
recommend anything in particular about Maximum Likelihood. Almost
anything on statistical models or statistical estimation will do --
you should browse at your local library.
I thought I remembered seeing something on the subject of ML and Neural
Networks, and I was correct. Check out:
Maximum Likelihood Neural Networks for Sensor Fusion and Adaptive
Classification, by Leonid I. Perlovsky and Margaret M. McManus;
Neural Networks, Vol 4, #1; 1991; pp89-102
I haven't read it -- only skimmed it -- but you might be able to apply
the method from information in that article without needing to fully
understand it (though understanding is always better).
Topher
|
1803.6 | Maximul Likelihood ! | SNOFS1::ZANOTTO | | Mon Oct 11 1993 20:57 | 8 |
| Hi Topher
Many thanks for that information. I'll start looking around the place
and hopefully I'll find what I am looking for.
Regards
Frank Zanotto
|
1803.7 | An example. | CADSYS::COOPER | Topher Cooper | Tue Oct 12 1993 16:00 | 84 |
| Here is a simple example of the use of the Maximum Likelihood
Principle:
Lets suppose that you have N Bernoulli trials (this means that you have
done something N times; that there are two possible outcomes each time;
and that the probability, p, that the first outcome will occur is the
same for all the trials, indepedent of what occurred in the previous
trials, how much time has elapsed etc.). The first outcome occurred
X times. You want an estimate of what the value of "p" was which
caused this to occur.
What you would ideally like to choose is the value of p which maximizes
prob(p | X)
That is, given the evidence that X provides, you want the most likely
value for p. Unfortunately, according to the traditional "frequentist"
interpretation of probability, this is not a determinable value.
Instead you can seek to maximize Lx(p) which is the liklihood function
for p given the specific value of X:
Lx(p) = C*prob(X | p)
for some unknowable constant C.
The prob(X | p), the probability that there would be "X" occurances out of
"N" of the first outcome given that the probability for each one is
"p", is determined by the Binomial distribution.
/ \
| N | x n-x
| | p (1-p)
| X |
\ /
The constant factor will not affect where the maximum occurs, so we can
just look for the p which maximizes prob(X | p). As is frequently the
case, it turns out to be easier to find the maximum of the log of this
formula. To find the maximum we solve for the derivative with respect
to "p":
d(log(prob(X | p)))/dp =
/ \
| N | x N-x
d log( | | p (1-p) )/dp =
| X |
\ /
/ \
| N | x N-x
d log( | | ) + log( p (1-p) )/dp =
| X |
\ /
/ \
| N | x N-x
d log( | | )/dp + d log( p (1-p) )/dp
| X |
\ /
The first term is the derivative of a constant (no dependence on p) so
we want to solve:
x N-x
d log( p (1-p) )/dp = 0
x N-x
--- - ----- = 0
p 1-p
which gives us as the p value at which the likelihood function is a
maximum:
x
p. = ---
N
Which is the unsurprising result: If for example, you observe
something happening 5 times out of 10, your "best" guess (according
to the Maximum Likelihood principle) is that it occurs half the time.
Topher
|
1803.8 | maximum liklihood and decision theory | BIGVAX::NEILSEN | Wally Neilsen-Steinhardt | Wed Oct 13 1993 13:26 | 46 |
| Frank,
I am not going to correct anything Topher has said, but I'll add another
viewpoint which you may find interesting if you want to one-up your friend
in Sydney.
An alternative to the frequentist interpretation used by Fisher is the
subjective or Bayesian interpretation of probability. In this interpretation
it is natural to speak of the probability of some value of the parameter p,
given some data X (using the symbols of Topher's .7). It is also natural to
speak of the probability distribution as a function of p, given X. And it
is natural to focus on the maximum of this distribution, so the Principle
of Maximum Liklihood seems to just fall out.
But you can get a lot more than this if you look close. The Principle of
Maximum Liklihood actually depends on some implicit assumptions about what
you are going to do with the value of p that you estimate, and the costs
associated with estimating it incorrectly. When we make these assumptions
explicit, it often turns out that there is a better (more cost effective)
estimate of the parameter p.
The study of these assumptions and estimates is called decision theory or
statistical decision theory. In principle, it would allow your neural net
to make better decisions. A book I often use, which covers a range of
statistical methods, is _Statistics - Probability, Inference and Decision_
R. L. Winkler and W. L. Hays, Holt Rinehart and Winston, 1975.
In practice, there may be some limitations on actually using decision theory
inside a neural net.
1. You may decide that the additional math is more than you care to deal with.
Maximum liklihood is usually simpler, but not by a lot.
2. Your neural net may not have enough cpu muscle or real time to do the
calculations. In general both maximum liklihood and decision theory require
a lot of computation to carry through. In many special cases, either or both
may simplify down to a bit of simple arithmetic.
3. The calculations for either maximum liklihood or decision theory may have
instabilities or other computationally undesirable properties. At least if
you have some alternatives, you have a better change of avoiding the
instabilities.
4. The actual problem you are working on may be such that there is no
particular benefit to using decision theory.
|
1803.9 | Yup. | CADSYS::COOPER | Topher Cooper | Wed Oct 13 1993 18:00 | 38 |
| A neural-net is a hardware or software embodiment of a class of
statistical procedures -- most commonly statistical classification or
clustering procedures. Many neural-net people get upset when you say this,
because it implies -- accurately -- that what they are dealing with is
just another set of statistical procedures, though perhaps particularly
interesting ones.
Looked at that way, we can look at the process of training a neural-net
as follows. We can imagine that there is a neural-net of the
configuration we are looking at (i.e., a set of weights) which
maximally classifies all possible inputs. We want to estimate that
set of weights on the basis of a limited sample.
This is normally done by some kind of iterative procedure which takes
each sample input, computes the current networks output. and grades
the output in terms of the known "proper" behavior for that input. The
grading is where the relative costs of different kinds of errors can
be -- and very frequently are -- factored in. This grade is then used
to modify all the weights to try to reduce the error and the process is
repeated: with the same sample, with a previously processed sample, or
with a new sample, depending on the specific training method.
It has been shown that under quite general assumptions that given
indefinite computational resources, that decision theory based on
Bayesian statistics makes optimal use of information. This means that
a tractable Bayesian computation (or a good, tractable approximation)
theory would be ideal.
In fact, the people who wrote the article I spoke of seem well aware of
this and, as I remember, spoke of Maximum Likelihood and Bayesian as
equivalent (which they are under the assumption of uniform Bayesian
prior). Not being statisticians they didn't have to take sides and
decide which of the two exactly equivalent things they were doing.
So -- Maximum Likelihood trained nets are (or at least purport to be)
(Bayesian) decision theory based.
Topher
|
1803.10 | yup, again, almost | ICARUS::NEILSEN | Wally Neilsen-Steinhardt | Thu Oct 14 1993 12:41 | 15 |
| .9> In fact, the people who wrote the article I spoke of seem well aware of
> this and, as I remember, spoke of Maximum Likelihood and Bayesian as
> equivalent (which they are under the assumption of uniform Bayesian
> prior). Not being statisticians they didn't have to take sides and
> decide which of the two exactly equivalent things they were doing.
Actually, it should take a few more assumptions to make them equivalent. For
example, that the posterior distribution is unimodal and roughly symmetric
(pretty likely in the real world) and the loss function is symmetric and
well behaved (also likely).
If non-statisticians casually mention decision theory, then I'd guess it was
pretty well known in this field, and somebody once went to the trouble of
showing that Maximum Liklihood is a sufficiently good approximation to
decision theory.
|
1803.11 | I can out nit-pick you, I bet. | CADSYS::COOPER | Topher Cooper | Thu Oct 14 1993 16:12 | 34 |
| They didn't actually mention decision theory to the best of my memory,
what they mentioned was some phrase like "according to the Bayesian
criteria". It is standard practice, however to include relative costs
of different kinds of errors in the evaluation function. Those two
together make Bayesian Decision Theory.
>Actually, it should take a few more assumptions to make them equivalent. For
>example, that the posterior distribution is unimodal and roughly symmetric
>(pretty likely in the real world) and the loss function is symmetric and
>well behaved (also likely).
There are a number of criteria used in Bayesian point estimation, but
the most common is the mode (if it exists) of the posterior
distribution. No even approximate symmetry in the posterior
distribution is necessary (though gross skew might call into question
the appropriatness of the criterion in both cases). In both Bayesian
point estimation and Maximum Likelihood the procedure fails or is
ambiguous if there isn't a clear maximum.
Your points about the cost function needing to be well behaved, apply
equally, I think, whether the costs are applied to a true ML front end
or a true Bayesian front-end as well.
>pretty well known in this field, and somebody once went to the trouble of
>showing that Maximum Liklihood is a sufficiently good approximation to
>decision theory.
I don't know that anyone has bothered to show this. ML neural-nets
is not the mainstream. Cost functions (whose presence in the
evaluatory functions used in training makes the neural net a decsion
theoretic procedure whether or not it is properly done) are, however,
mainstream in NN work.
Topher
|
1803.12 | Maximum likelihood ! | SNOFS1::ZANOTTO | | Sun Oct 17 1993 21:09 | 12 |
| Hello all
Thanks for the input. Definately alot to digest ! One question if I
may, I have been told that there a neural network can solve any
problem, that is any problem that a maximum likelhood
algorithm / theorum can solve. That is a standard backpropogation
neural network without any modifications is all I need. Am I right is
saying / believing this ?
Regards
Frank Zanotto
|