Abstract
The potential for research involving biospecimens can be
hindered by the high cost of laboratory assays. To reduce cost,
strategies such as randomly selecting a portion of specimens for
analysis or randomly pooling specimens prior to performing
laboratory assays may be employed, yet are often accompanied by a
considerable loss of statistical efficiency. Intuitively, forming
pools from specimens with similar covariate values will help
maintain high precision levels among regression coefficient
estimates by preserving the relationship between the outcome and
predictor variables. To implement this strategy, we propose a novel
pooling method based on the k-means clustering algorithm. This
method is tested in a linear regression setting, then applied in
subsequent studies to promote efficiency. Linear regression
provides a convenient avenue to test potential efficiency gains
from k-means pooling. Many biomarkers measured in epidemiological
studies, however, exhibit a positive, right-skewed distribution,
for which linear regression may not be appropriate. Regression
models suitable to this type of outcome data are explored,
including a modification of multiple linear regression on a
log-transformation of pool-wise data and a novel parameterization
of the gamma distribution. If pools are formed from specimens with
identical covariate values, regression analyses on a right-skewed,
pooled outcome are greatly simplified. When these x-homogeneous
pools cannot be formed, we propose a quasi-likelihood model for
pooled specimens as well as a Monte Carlo Expectation Maximization
(MCEM) algorithm. We then develop an extension of Akaike's
Information Criterion to help select the best model. Simulations
demonstrate that these analytical methods provide essentially
unbiased estimates of coefficient parameters as well as their
standard errors when appropriate assumptions are met. In
conclusion, when the number of laboratory tests is limited by
budget, pooling specimens prior to performing lab assays can be an
eeffective way to save money. High levels of precision can be
maintained by exploiting covariate information to form pools, as in
k-means pooling, then selecting the best-fitting model using an
AIC-type criterion. When pools are formed strategically and
analyzed under the appropriate models, pooling can considerably
reduce costs with minimal information loss.
Table of Contents
Contents
1 Background 1
1.1 Origins and Applications of Pooling . . . . . . . . . . . . . .
. . . . . . . . 1
1.2 Efficient Pooling Designs . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 2
1.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 3
1.3 Pooling in Logistic Regression . . . . . . . . . . . . . . . .
. . . . . . . . . . 5
1.3.1 When a Binary Outcome is Pooled . . . . . . . . . . . . . . .
. . . . 5
1.3.2 When an Exposure is Pooled . . . . . . . . . . . . . . . . .
. . . . . 6
1.4 Pooling on a Right-Skewed, Continuous Variable . . . . . . . .
. . . . . . . 7
1.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 7
1.4.2 Approximating the Density of a Sum of Random Variables . . .
. . 8
1.4.3 Monte Carlo Expectation Maximization (MCEM) Algorithm . . . .
9
1.5 Quasi-Likelihood . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 11
2 Linear Regression on a Pooled Outcome. . . . . . . . . . . . . .
. . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 13
2.2 Regression Formulation . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 14
2.2.1 Equal Aliquot Volumes . . . . . . . . . . . . . . . . . . . .
. . . . . 14
2.2.2 Unequal Aliquot Volumes . . . . . . . . . . . . . . . . . . .
. . . . . 16
2.3 Pooling and Selection Methods . . . . . . . . . . . . . . . . .
. . . . . . . . 18
2.3.1 "Smart" Selection . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 18
2.3.2 "Smart" Pooling . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 19
2.3.3 k-means Clustering . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 20
2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 25
2.4.1 SLR: Equal Aliquots . . . . . . . . . . . . . . . . . . . . .
. . . . . . 26
2.4.2 SLR: Unequal Aliquots . . . . . . . . . . . . . . . . . . . .
. . . . . 28
2.4.3 Multiple Linear Regression . . . . . . . . . . . . . . . . .
. . . . . . 31
2.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 35
2.6 Applying k-means to Logistic Regression . . . . . . . . . . . .
. . . . . . . . 36
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 38
3 Lognormal Regression Models for a Skewed, Pooled Outcome . . .
41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 41
3.2 A Motivating Example: Cytokines in the CPP . . . . . . . . . .
. . . . . . 42
3.3 Regression Model for Individual Subjects . . . . . . . . . . .
. . . . . . . . 43
3.4 Naive Model for Pooled Data . . . . . . . . . . . . . . . . . .
. . . . . . . . 44
3.5 Approximate Model for Pooled Data . . . . . . . . . . . . . . .
. . . . . . . 46
3.6 Calculating MLEs . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 49
3.7 MLEs via MCEM . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 50
3.7.1 E step . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 50
3.7.2 Monte Carlo Estimation . . . . . . . . . . . . . . . . . . .
. . . . . . 51
3.7.3 M step . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 55
3.7.4 Standard Error Estimation . . . . . . . . . . . . . . . . . .
. . . . . 56
3.7.5 Example: Lognormal Distribution . . . . . . . . . . . . . . .
. . . . 57
3.8 Pooling Methods . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 60
3.8.1 x-homogeneous Pools . . . . . . . . . . . . . . . . . . . . .
. . . . . 61
3.8.2 k-means Clustering . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 61
3.9 Simulation Study . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 62
3.9.1 Comparing Analytical Strategies . . . . . . . . . . . . . . .
. . . . . 63
3.9.2 Comparing Pooling Strategies . . . . . . . . . . . . . . . .
. . . . . . 64
3.9.3 Convolution Method . . . . . . . . . . . . . . . . . . . . .
. . . . . . 66
3.9.4 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 67
3.10 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 69
3.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 71
4 Comparing Parametric and Semi-Parametric Models for a Skewed,
Pooled
Outcome . . . . . . . . . . . . . . 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 72
4.2 Parametric Regression Models for Skewed Outcomes . . . . . . .
. . . . . . 74
4.2.1 Lognormal Model . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 74
4.2.2 Gamma1 Model . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 76
4.2.3 Gamma2 Model . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 80
4.3 Semi-parametric Regression Models for Skewed Data . . . . . . .
. . . . . . 81
4.3.1 Approximate Model Revisited . . . . . . . . . . . . . . . . .
. . . . . 83
4.3.2 Quasi-Likelihood Models . . . . . . . . . . . . . . . . . . .
. . . . . . 85
4.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 90
4.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 94
4.5.1 Lognormal . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 96
4.5.2 Gamma1 . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 98
4.5.3 Gamma2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 101
4.5.4 Precision . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 101
4.5.5 Naive QL Models . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 102
4.6 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 106
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 109
5 Summary and Future Research . . . . . . . . . . . . . . 111
A R and SAS Code Examples . . . . . . . . . . . . . . 122
A.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 122
A.2 gamma2 Model . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 123
A.3 QL Model under heterogeneous pools. . . . . . . . . . . . . . .
. . . . . . . 124
About this Dissertation
Rights statement
- Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School |
|
Department |
|
Degree |
|
Submission |
|
Language |
|
Research Field |
|
Palavra-chave |
|
Committee Chair / Thesis Advisor |
|
Committee Members |
|