Similarity and diversity measures for flow cytometry data Open Access
Ahmed, Hasan Rasheed (2011)
Abstract
Similarity and diversity measures for flow cytometry data
By Hasan Ahmed
This paper examines similarity measures (also known as
dissimilarity or statistical distance
measures) for flow cytometry data. Similarity measures quantify the
similarity between two
objects and could be used for clustering or neighborhood-based
predictive modeling. I find that
earth mover's distance is the most appropriate tool for creating
similarity measures for flow
cytometry data. I compare this approach to earlier approaches that
relied on Kullback-Leibler
divergence, Pearson correlation or Lp distance. This paper also
examines diversity measures for
flow cytometry data. It identifies two types of diversity measures,
"nominal diversity measures"
and "interval diversity measure", and it explains the connection
between diversity measures
and similarity measures.
The similarity and diversity measures in this paper were designed
for flow cytometry data, but
they can be used for any dataset that consists of multisets of
equal-dimensional points. This
paper includes R code for implementing many of the methods
discussed. The datasets used in
this paper have been made publicly available.
Table of Contents
Table of Contents
I.
Introduction......................................................1
I.A. Flow cytometry
data............................................1
I.B. Similarity and diversity
measures..............................2
I.C. Previous
work..................................................4
I.D. Assay variation and Boolean gate
data..........................6
II. Similarity
measures..............................................9
II.A. Overview of similarity
measures...............................9
II.B. Moment
methods...............................................10
II.B.1. First moment
method.......................................10
II.B.2. Second moment
method......................................10
II.C. Probability density function
methods.........................11
II.D. Probability mass function
methods............................12
II.D.1. Equal-sized bins
method...................................12
II.D.2. K-means clustering
method.................................13
II.D.3. Sum of squares
method.....................................13
II.D.4. Trees: a supervised
method................................14
II.D.5. Bumped
trees..............................................15
II.E. Cumulative distribution function
method......................16
II.F. Pure earth mover's
distance..................................16
III. Comparison of similarity
measures..............................17
III.A. Desirable properties for similarity
measures................17
III.B. Other
considerations........................................20
III.C. Testing accuracy and consistency using dataset
1............21
III.D. Testing resistance to assay variation using dataset
2.......27
III.E.
Conclusions.................................................31
IIII. Diversity
measures............................................34
IIII.A. Two types of diversity
measures............................34
IIII.B. Nominal diversity
measures.................................35
IIII.C. Interval diversity
measures................................36
IIII.D. Assessing diversity measures using dataset
1...............37
IIII.E.
Conclusions................................................39
V.
Endnotes.........................................................40
VI. R
Code..........................................................42
Table 1: Various similarity
measures.................................2
Figure 1: Assay variation between two
samples........................6
Table 2: Example of raw
data.........................................8
Table 3: Example of Boolean gate
data................................8
Table 4: Example of Boolean gate data in compact
form................8
Table 5: Comparison of similarity measures based on
properties......19
Table 6: Comparison of similarity measures based on the
correlation
between log viral load and nearest-neighbor log viral
load..........22
Table 7: Mean correlation between a similarity measure and all
other
similarity measure
implementations..................................25
Table 8: Mean correlation within a family of similarity
measures....27
Table 9: Similarity matrix produced by first moment method and
L1
distance using raw
data.............................................29
Table 10: Similarity matrix produced by first moment method and
L1
distance using Boolean gate
data....................................29
Table 11: Assay variation
scores....................................30
Table 12: Comparison of diversity
measures..........................37
About this Master's Thesis
School | |
---|---|
Department | |
Subfield / Discipline | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members | |
Partnering Agencies |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Similarity and diversity measures for flow cytometry data () | 2018-08-28 12:51:31 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
dataset1_Boolean.csv () | 2018-08-28 12:54:37 -0400 |
|
|
dataset2_raw.csv () | 2018-08-28 12:56:09 -0400 |
|
|
dataset1_raw.csv () | 2018-08-28 12:57:36 -0400 |
|
|
SupplementaryTables.xls () | 2018-08-28 12:59:06 -0400 |
|
|
dataset2_Boolean.csv () | 2018-08-28 13:00:31 -0400 |
|