Regression Models for a Continuous Outcome Subject to Pooling Open Access

Mitchell, Emily (2013)

Permanent URL: https://etd.library.emory.edu/concern/etds/4x51hj198?locale=en

Published

Abstract

The potential for research involving biospecimens can be hindered by the high cost of laboratory assays. To reduce cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed, yet are often accompanied by a considerable loss of statistical efficiency. Intuitively, forming pools from specimens with similar covariate values will help maintain high precision levels among regression coefficient estimates by preserving the relationship between the outcome and predictor variables. To implement this strategy, we propose a novel pooling method based on the k-means clustering algorithm. This method is tested in a linear regression setting, then applied in subsequent studies to promote efficiency. Linear regression provides a convenient avenue to test potential efficiency gains from k-means pooling. Many biomarkers measured in epidemiological studies, however, exhibit a positive, right-skewed distribution, for which linear regression may not be appropriate. Regression models suitable to this type of outcome data are explored, including a modification of multiple linear regression on a log-transformation of pool-wise data and a novel parameterization of the gamma distribution. If pools are formed from specimens with identical covariate values, regression analyses on a right-skewed, pooled outcome are greatly simplified. When these x-homogeneous pools cannot be formed, we propose a quasi-likelihood model for pooled specimens as well as a Monte Carlo Expectation Maximization (MCEM) algorithm. We then develop an extension of Akaike's Information Criterion to help select the best model. Simulations demonstrate that these analytical methods provide essentially unbiased estimates of coefficient parameters as well as their standard errors when appropriate assumptions are met. In conclusion, when the number of laboratory tests is limited by budget, pooling specimens prior to performing lab assays can be an eeffective way to save money. High levels of precision can be maintained by exploiting covariate information to form pools, as in k-means pooling, then selecting the best-fitting model using an AIC-type criterion. When pools are formed strategically and analyzed under the appropriate models, pooling can considerably reduce costs with minimal information loss.

Contents
1 Background 1
1.1 Origins and Applications of Pooling . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Efficient Pooling Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Pooling in Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 When a Binary Outcome is Pooled . . . . . . . . . . . . . . . . . . . 5
1.3.2 When an Exposure is Pooled . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Pooling on a Right-Skewed, Continuous Variable . . . . . . . . . . . . . . . 7
1.4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Approximating the Density of a Sum of Random Variables . . . . . 8
1.4.3 Monte Carlo Expectation Maximization (MCEM) Algorithm . . . . 9
1.5 Quasi-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Linear Regression on a Pooled Outcome. . . . . . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Regression Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Equal Aliquot Volumes . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Unequal Aliquot Volumes . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Pooling and Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 "Smart" Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 "Smart" Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 SLR: Equal Aliquots . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 SLR: Unequal Aliquots . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Applying k-means to Logistic Regression . . . . . . . . . . . . . . . . . . . . 36
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Lognormal Regression Models for a Skewed, Pooled Outcome . . . 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 A Motivating Example: Cytokines in the CPP . . . . . . . . . . . . . . . . 42
3.3 Regression Model for Individual Subjects . . . . . . . . . . . . . . . . . . . 43
3.4 Naive Model for Pooled Data . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Approximate Model for Pooled Data . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Calculating MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 MLEs via MCEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7.1 E step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7.2 Monte Carlo Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7.3 M step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7.4 Standard Error Estimation . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.5 Example: Lognormal Distribution . . . . . . . . . . . . . . . . . . . 57
3.8 Pooling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8.1 x-homogeneous Pools . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8.2 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.9 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.9.1 Comparing Analytical Strategies . . . . . . . . . . . . . . . . . . . . 63
3.9.2 Comparing Pooling Strategies . . . . . . . . . . . . . . . . . . . . . . 64
3.9.3 Convolution Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.9.4 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.10 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4 Comparing Parametric and Semi-Parametric Models for a Skewed, Pooled
Outcome . . . . . . . . . . . . . . 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Parametric Regression Models for Skewed Outcomes . . . . . . . . . . . . . 74
4.2.1 Lognormal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2 Gamma1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.3 Gamma2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Semi-parametric Regression Models for Skewed Data . . . . . . . . . . . . . 81
4.3.1 Approximate Model Revisited . . . . . . . . . . . . . . . . . . . . . . 83
4.3.2 Quasi-Likelihood Models . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5.1 Lognormal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.2 Gamma1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.3 Gamma2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.4 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.5 Naive QL Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5 Summary and Future Research . . . . . . . . . . . . . . 111
A R and SAS Code Examples . . . . . . . . . . . . . . 122
A.1 k-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.2 gamma2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.3 QL Model under heterogeneous pools. . . . . . . . . . . . . . . . . . . . . . 124

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Biostatistics
Degree	PhD
Submission	Dissertation
Language	English
Research Field	Biology, Biostatistics Statistics Health Sciences, Epidemiology
Keyword	MCEM k-means Clustering Regression Biomarkers Pooled Specimens Skewed Data
Committee Chair / Thesis Advisor	Lyles, Robert, Emory University
Committee Members	Long, Qi, Emory University Schisterman, Enrique, NICHD Manatunga, Amita, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Regression Models for a Continuous Outcome Subject to Pooling ()	2018-08-28 11:07:52 -0400	Download

Regression Models for a Continuous Outcome Subject to Pooling Open Access

Mitchell, Emily (2013)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files