The theory
of probability binning and frequency difference gating will be discussed on the
basis of three articles that describe the probability binning algorithm in
univariate analyses, the extension into a multivariate space, and application
to complex data sets, but will not address the mathematical arguments that the
theory is based on.
Probability Binning Comparison: A Metric for Quantitating
Univariate Distribution Differences (Cytometry 45:37–46, 2001)
Mario
Roederer, Adam Treister, Wayne Moore, Leonore A. Herzenberg
Background: Comparing distributions of data is an
important goal in many applications. For example, determining whether two
samples (e.g., a control and test sample) are statistically significantly
different is useful to detect a response, or to provide feedback regarding
instrument stability by detecting when collected data varies signifi-cantly
over time.
Methods: We apply a variant of the
chi-squared statistic to comparing univariate distributions. In this variant, a
con-trol distribution is divided such that an equal number of events fall into
each of the divisions, or bins. This ap-proach is thereby a mini-max algorithm,
in that it mini-mizes the maximum expected variance for the control
distribution. The control-derived bins are then
applied to test sample distributions, and a normalized chi-squared
value is computed. We term this
algorithm Probability Binning.
Results: Using a Monte-Carlo simulation, we
determined the distribution of chi-squared values obtained by compar-ing sets
of events derived from the same distribution. Based on this distribution, we
derive a conversion of any given chi-squared value into a metric that is
analogous to a t-score, i.e., it can be used to estimate the probability that a
test distribution is different from a control distribu-tion. We demonstrate
that this metric scales with the difference between two distributions, and can
be used to rank samples according to similarity to a control. Finally, we
demonstrate the applicability of this metric to ranking immunophenotyping
distributions to suggest that it in-deed can be used to objectively determine
the relative distance of distributions compared to a single control.
Conclusion: Probability Binning, as shown here,
pro-vides a useful metric for determining the probability that
two or more flow cytometric data distributions
are different This metric can also be used to rank distributions to
identify which are most similar or
dissimilar. In addition, the algorithm can be used to quantitate contamination
of even highly-overlapping populations.
Probability
Binning Comparison: A Metric for Quantitating Multivariate Distribution Differences (Cytometry 45:47–55, 2001)
Mario Roederer, Wayne Moore, Adam
Treister, Richard R. Hardy, Leonore A. Herzenberg
Background: While several algorithms for the
compar-ison of univariate distributions arising from flow cyto-metric analyses
have been developed and studied for many years, algorithms for comparing
multivariate dis-tributions remain elusive. Such algorithms could be useful for
comparing differences between samples
based on several
independent measurements, rather than differences based on any single
measurement. It
is
conceivable that distributions
could be completely dis-tinct in multivariate space, but unresolvable in any
combination of
univariate histograms. Multivariate com-parisons could also be useful for providing feedback
about instrument stability, when
only subtle changes in measurements are occurring.
Methods: We apply a variant of Probability
Binning, de-scribed in the accompanying article, to multidimensional data. In
this approach, hyper-rectangles of n dimensions (where n is the
number of measurements being com-pared) comprise the bins used for the
chi-squared statistic. These hyper-dimensional bins are constructed such that
the control sample has the same number of events in each bin; the bins are then
applied to the test samples for chi-squared calculations.
Results: Using a Monte-Carlo simulation, we
determined the distribution of chi-squared values obtained by compar-ing sets
of events from the same distribution; this distri-bution of chi-squared values
was identical as for the uni-variate algorithm. Hence, the same formulae can be
used to construct a metric, analogous to a t-score, that
estimates the probability with which distributions are distinct. As for
univariate comparisons, this metric scales with the difference between two
distributions, and can be used to rank samples according to similarity to a
control. We apply the algorithm to multivariate immunophenotyping data, and
demonstrate that it can be used to discriminate distinct samples and to rank
samples according to a bio-logically- meaningful difference.
Conclusion: Probability binning, as shown here,
pro-vides a useful metric for determining the probability with
which two or more multivariate distributions
represent distinct sets of data. The metric can be
used to identify the
similarity or
dissimilarity of samples. Finally, as demon-strated in the accompanying paper, the
algorithm can be
used to gate on events in one
sample that are different from a control sample, even
if those events cannot be
distinguished on the basis of any
combination of univari-ate or bivariate displays.
Frequency
Difference Gating: A Multivariate Method for Identifying Subsets That Differ
Between Samples (Cytometry 45: 56–64, 2001)
Mario Roederer and Richard R. Hardy
Background: In multivariate distributions (for
example, in 3- or more color flow cytometric datasets), it can become difficult
or impossible to identify populations that differ be-tween samples based only
on a combination of univariate or bivariate displays. Indeed, it is possible
that such differences can only be identified in “n”-dimensional space, where
“n” is the number of parameters measured. Therefore, computer assisted
identification of such differences is necessary. Such a method could be used to
identify responses (i.e., by com-paring cell samples before and after
stimulation) in exquisite detail by allowing complete analysis of the collected
data on
only those events
which have responded.
Methods: Multivariate Probability Binning
can be used to compare different datasets to identify the distance and
sta-tistical significance of a difference between the distributions. An
intermediate step in the algorithm provides access to the actual locations within
the n-dimensional comparison which are most different between the
distributions. Gates based on collections of hyper-rectangular bins can then be
applied to datasets, thereby selecting those events (or clusters of events)
that are different between samples. We term this process Frequency Difference
Gating.
Results. Frequency Difference Gating was used in several test
scenarios to evaluate its utility. First, we compared PBMC subsets identified by
solely by immunofluorescence staining: based on this training data set, the
algorithm automatically generated an accurate forward and side-scatter gate to
identify lymphocytes. Second, we applied the algorithm to identify subtle
differences between CD4 memory subsets based on 8-color immunophenotyping data.
The resulting 3-dimensional gate could resolve cells subsets much more frequent
in one subset compared to the other; no combination of two-dimensional gates
could accomplish this resolution. Finally, we used the algorithm to compare B
cell populations derived from mice of dif-ferent ages or strains, and found
that the algorithm could find very subtle differences between the populations.
Conclusion. Frequency Difference Gating is a powerful tool that
automates the process of identifying events com-prising underlying differences
between samples. It is not a clustering tool; it is not meant to identify
subsets in multidimensional space. Importantly, this method may reveal subtle
changes in small populations of cells,
changes that only occur
simultaneously in multiple dimen-sions in such a way that identification by
univariate or
bivariate
analyses is
impossible. Finally, the method may significantly aid in the analysis of
high-order multivariate data (i.e., 6-12 color flow cytometric analyses), where
identification of differences between datasets becomes so
time-consuming as to be impractical.