likelihood
ated parameter values.
this data was the
function
to find
"Simplex" algorithm;
wo
the associ-
there are currently available a
The computations outlined above are performed for a sequence of successively more complex statistical models of mixtures of normal distri-~butions, starting with a single normal distribution, then a mixture of
two normal distributions with the same mean, then two normal distributions with different means,
and so on.
The choice of the most compli-
N(u,o)
where
Mathematical Expression of a Mixture of
Normal Distributions
Figure 2.
Co
1
27
¥3
e
1 (x 3uy’
cated model to try is made from the cumulative distribution plot, such
as shown in Figure 1.
In our data a maximum of four distributions was
used.
These models were compared using the likelihoad ratio test; the
details of this test are also found in elementary mathematical statistics
texts,
Note that,
with this test,
we cannot decide if any one
model is a good fit to the data cr not, we can only decide the relative.
merits of the models used. The successively more complex models were
pairwise compared using the likelihood ratio principle until no statistically significant improvement in the fit to the data was found.
A model consisting of a mixture of three normal distributions was found
to best represent our example data; these distributions are plotted in
the black curves on Figure 3. The broken line curve was calculated from
the
sample mean and standard deviation of all the data assuming a Single
normal distribution. The relative size of the black curves, the area
under these curves, is drawn here to represent the proportions of the
data in each of the component distributions.
The broken line curve is
simply drawn a convenient size. Specifically this data set is best
modeled by 21 percent of the data having a mean of
390 and a standard
deviation of 195 (the wide curve at the bottom), 22 percent of the data
having a mean of 394 and a standard deviation of 11 (the lower middle
curve),
and
57 percent of the data having a mean of 454 and a standard
deviation of 13 (the tallest curve).
Some additional
information is of
essentially the
same as the
37.
interest.
The
total sample size was
The “known value" of the material sent to the laboratories was 452,
454 mean of
the
tallest curve,
thus we
suspect that this component curve represents the "good" laboratories.
The average of all the data,
represented by the broken line curve,
was
440, which, if used to characterize the data set, suggests a bias from
the known value--a bias
that disappears
if we use the component curves.
Consider the range of concentrations defining the 95-percent area of
the "good” or tallest curve.
Eleven percent of the wide distribution
is within this good range.
Since the wide group represents 21 percent
of the data, about 2 percent (21%*11%) of the data values would be misclassified as good when in fact the values are from poorly performing
laboratories that just happened to hit
it right this time.
Less
than
One percent of the lower middle curve is actually within this 95 pDercent
interval of
the upper middle curve.
One could,
of course,
this exercise with any wcceptance criteria one wished.
lower middle group represents a group of
597
5
its maximum and
The particular maximization technique used on
number of other good algorithms.
—e
—
f(x)=p,*N(u.o;) + p,*N(u.,0;)+ (1-p,—p,) N(u,,9;,)
@iractly upon the
eight
repeat
We suspect the
laboratories with good