Similarity and diversity measures for flow cytometry data Público

Ahmed, Hasan Rasheed (2011)

Permanent URL: https://etd.library.emory.edu/concern/etds/2b88qc555?locale=es

Published

Abstract

Similarity and diversity measures for flow cytometry data

By Hasan Ahmed

This paper examines similarity measures (also known as dissimilarity or statistical distance
measures) for flow cytometry data. Similarity measures quantify the similarity between two
objects and could be used for clustering or neighborhood-based predictive modeling. I find that
earth mover's distance is the most appropriate tool for creating similarity measures for flow
cytometry data. I compare this approach to earlier approaches that relied on Kullback-Leibler
divergence, Pearson correlation or Lp distance. This paper also examines diversity measures for
flow cytometry data. It identifies two types of diversity measures, "nominal diversity measures"
and "interval diversity measure", and it explains the connection between diversity measures
and similarity measures.

The similarity and diversity measures in this paper were designed for flow cytometry data, but
they can be used for any dataset that consists of multisets of equal-dimensional points. This
paper includes R code for implementing many of the methods discussed. The datasets used in
this paper have been made publicly available.

Table of Contents

I. Introduction......................................................1
I.A. Flow cytometry data............................................1
I.B. Similarity and diversity measures..............................2
I.C. Previous work..................................................4
I.D. Assay variation and Boolean gate data..........................6
II. Similarity measures..............................................9
II.A. Overview of similarity measures...............................9
II.B. Moment methods...............................................10
II.B.1. First moment method.......................................10
II.B.2. Second moment method......................................10
II.C. Probability density function methods.........................11
II.D. Probability mass function methods............................12
II.D.1. Equal-sized bins method...................................12
II.D.2. K-means clustering method.................................13
II.D.3. Sum of squares method.....................................13
II.D.4. Trees: a supervised method................................14
II.D.5. Bumped trees..............................................15
II.E. Cumulative distribution function method......................16
II.F. Pure earth mover's distance..................................16
III. Comparison of similarity measures..............................17
III.A. Desirable properties for similarity measures................17
III.B. Other considerations........................................20
III.C. Testing accuracy and consistency using dataset 1............21
III.D. Testing resistance to assay variation using dataset 2.......27
III.E. Conclusions.................................................31
IIII. Diversity measures............................................34
IIII.A. Two types of diversity measures............................34
IIII.B. Nominal diversity measures.................................35
IIII.C. Interval diversity measures................................36
IIII.D. Assessing diversity measures using dataset 1...............37
IIII.E. Conclusions................................................39
V. Endnotes.........................................................40
VI. R Code..........................................................42

Table 1: Various similarity measures.................................2
Figure 1: Assay variation between two samples........................6
Table 2: Example of raw data.........................................8
Table 3: Example of Boolean gate data................................8
Table 4: Example of Boolean gate data in compact form................8
Table 5: Comparison of similarity measures based on properties......19
Table 6: Comparison of similarity measures based on the correlation
between log viral load and nearest-neighbor log viral load..........22
Table 7: Mean correlation between a similarity measure and all other
similarity measure implementations..................................25
Table 8: Mean correlation within a family of similarity measures....27
Table 9: Similarity matrix produced by first moment method and L1
distance using raw data.............................................29
Table 10: Similarity matrix produced by first moment method and L1
distance using Boolean gate data....................................29
Table 11: Assay variation scores....................................30
Table 12: Comparison of diversity measures..........................37

About this Master's Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Rollins School of Public Health
Department	Biostatistics and Bioinformatics
Subfield / Discipline	Biostatistics - MPH & MSPH
Degree	MPH
Submission	Master's Thesis
Language	English
Research Field	Biology, Biostatistics Statistics
Palabra Clave	similarity distance diversity flow cytometry statistics dissimilarity
Committee Chair / Thesis Advisor	Hertzberg, Vicki S, Emory University
Committee Members	Elon, Lisa, Emory University
Partnering Agencies	Emory University schools, faculty or affiliated programs

Última modificación

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Similarity and diversity measures for flow cytometry data ()	2018-08-28 12:51:31 -0400	Download

Supplemental Files

Title	Date Uploaded	Actions
dataset1_Boolean.csv ()	2018-08-28 12:54:37 -0400	Download
dataset2_raw.csv ()	2018-08-28 12:56:09 -0400	Download
dataset1_raw.csv ()	2018-08-28 12:57:36 -0400	Download
SupplementaryTables.xls ()	2018-08-28 12:59:06 -0400	Download
dataset2_Boolean.csv ()	2018-08-28 13:00:31 -0400	Download