Consensus clustering of subclone structure for multi-sample sequencing data Open Access

Zheng, Hanyi (Spring 2020)

Permanent URL:


Background: The tumor heterogeneity describes the heterogeneity in morphology and phenotype in tumor cells and is related to cancer therapeutics. The accurate assessment of tumor heterogeneity is an essential step for understanding how a tumor evolves and the determination of tumor subpopulation is a challenge. In this work, we present a combinatorial algorithm that can exploit samples from multiple time points over the development of the tumor within a single patient to determine the subclone cluster.

Methods: We firstly estimated CCF (cancer cell fraction) and cluster information for each time point by implementing a hierarchical Bayes statistical model and MCMC process (Pyclone). After the imputation of co-clustering matrix, we used non-negative sparse coding to determine consensus cluster across all time points to avoid trivial cluster. Finally, we made adjustment to the covariance matrix and used BIC to decide the optimal number of clusters.

Results: We use weighted CCF as the CCF for the cluster and observe the trend of each cluster. For PR42, k=5 is the optimal cluster number and every cluster has a unique trend. For PR44, the whole trend for mutations goes down then goes up, which implies that the therapy does well at first but then lost its effect. For PR240 we found that the therapy is ineffective for this patient at all since the trend of CCF for all clusters across all K increases with time.

Conclusions: This study presents a combinatorial algorithm to decide the subclone cluster of multi-timepoints tumor gene data. The model works well when data does not have a high percentage of missing mutations. Besides, the purity of the sample and the trivial clusters generated by Pyclone can affect the results. We also found that missing mutations directly impact the co-clustering matrix and covariance matrix in the BIC step.

Table of Contents

Table of Contents

1.Introduction. 1

2 Methods. 4

2.1 Data Collection and Cleaning. 4

2.2 Pyclone. 4

2.3 Co-Clustering matrix imputation. 6

2.4 Non-negative sparse coding. 8

2.5 Deciding optimal number of clusters. 8

2.5.1 Covariance matrix adjustment 9

2.5.2 likelihood function. 10

3. Result 11

3.1 Data summary. 11

3.2 Result of Pyclone. 12

3.3 Consensus cluster 13

3.3.1 weighted CCF. 13

3.3.2 PR42 result 14

3.3.3 PR44 result 16

3.3.4 PR240 result 18

3.3.5 PR246. 19

4 Discussion. 21

Reference. 22

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
Subfield / Discipline
  • English
Research Field
Committee Chair / Thesis Advisor
Last modified

Primary PDF

Supplemental Files