Similarity and diversity measures for flow cytometry data Open Access

Ahmed, Hasan Rasheed (2011)

Permanent URL: https://etd.library.emory.edu/concern/etds/2b88qc555?locale=en
Published

Abstract

Similarity and diversity measures for flow cytometry data


By Hasan Ahmed


This paper examines similarity measures (also known as dissimilarity or statistical distance
measures) for flow cytometry data. Similarity measures quantify the similarity between two
objects and could be used for clustering or neighborhood-based predictive modeling. I find that
earth mover's distance is the most appropriate tool for creating similarity measures for flow
cytometry data. I compare this approach to earlier approaches that relied on Kullback-Leibler
divergence, Pearson correlation or Lp distance. This paper also examines diversity measures for
flow cytometry data. It identifies two types of diversity measures, "nominal diversity measures"
and "interval diversity measure", and it explains the connection between diversity measures
and similarity measures.


The similarity and diversity measures in this paper were designed for flow cytometry data, but
they can be used for any dataset that consists of multisets of equal-dimensional points. This
paper includes R code for implementing many of the methods discussed. The datasets used in
this paper have been made publicly available.

Table of Contents


Table of Contents

I. Introduction......................................................1
I.A. Flow cytometry data............................................1
I.B. Similarity and diversity measures..............................2
I.C. Previous work..................................................4
I.D. Assay variation and Boolean gate data..........................6
II. Similarity measures..............................................9
II.A. Overview of similarity measures...............................9
II.B. Moment methods...............................................10
II.B.1. First moment method.......................................10
II.B.2. Second moment method......................................10
II.C. Probability density function methods.........................11
II.D. Probability mass function methods............................12
II.D.1. Equal-sized bins method...................................12
II.D.2. K-means clustering method.................................13
II.D.3. Sum of squares method.....................................13
II.D.4. Trees: a supervised method................................14
II.D.5. Bumped trees..............................................15
II.E. Cumulative distribution function method......................16
II.F. Pure earth mover's distance..................................16
III. Comparison of similarity measures..............................17
III.A. Desirable properties for similarity measures................17
III.B. Other considerations........................................20
III.C. Testing accuracy and consistency using dataset 1............21
III.D. Testing resistance to assay variation using dataset 2.......27
III.E. Conclusions.................................................31
IIII. Diversity measures............................................34
IIII.A. Two types of diversity measures............................34
IIII.B. Nominal diversity measures.................................35
IIII.C. Interval diversity measures................................36
IIII.D. Assessing diversity measures using dataset 1...............37
IIII.E. Conclusions................................................39
V. Endnotes.........................................................40
VI. R Code..........................................................42

Table 1: Various similarity measures.................................2
Figure 1: Assay variation between two samples........................6
Table 2: Example of raw data.........................................8
Table 3: Example of Boolean gate data................................8
Table 4: Example of Boolean gate data in compact form................8
Table 5: Comparison of similarity measures based on properties......19
Table 6: Comparison of similarity measures based on the correlation
between log viral load and nearest-neighbor log viral load..........22
Table 7: Mean correlation between a similarity measure and all other
similarity measure implementations..................................25
Table 8: Mean correlation within a family of similarity measures....27
Table 9: Similarity matrix produced by first moment method and L1
distance using raw data.............................................29
Table 10: Similarity matrix produced by first moment method and L1
distance using Boolean gate data....................................29
Table 11: Assay variation scores....................................30
Table 12: Comparison of diversity measures..........................37

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Partnering Agencies
Last modified

Primary PDF

Supplemental Files