Automating Biomedical Abstract Screening using Network Embedding Open Access

Lee, Eric (Fall 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/3j3333843?locale=en
Published

Abstract

Systematic review (SR) is an essential process to identify, evaluate, and summarize the findings of all relevant individual studies concerning health-related questions. However, conducting a SR is labor-intensive, as identifying relevant studies is a daunting process that entails multiple researchers screening thousands of articles for relevance. Automating SR, especially abstract screening, using machine learning models has been proposed to identify relevant articles but primarily focuses on the text and ignores additional features like citation information. Recent work demonstrated that citation embeddings can outperform the text itself, suggesting that better network representation may expedite SRs. Yet, how to utilize the rich information in heterogeneous information networks (HIN) for network embeddings is understudied. Also, the lack of a unified source that includes the metadata of biomedical literature makes the research more challenging. To deal with this problem, we propose four works. First, we propose a model that exploits three representations, documents, topics, and citation networks to show the effectiveness of the additional features. Second, we introduce the PubMed Graph Benchmark, one of the largest HIN to date, which aggregates the rich metadata into a unified source that includes abstracts, authors, citations, MeSH terms, etc. Third, we propose a HIN embedding model that uses a community-based multi-view graph convolutional network for learning better representations using the PubMed Graph Benchmark. Lastly, we propose a hyperbolic representation learning model for graphs with mixed hierarchical (MeSH hierarchies) and non-hierarchical (citations) structures.

Table of Contents

1 Introduction 1

1.1 Motivation 1

1.2 Research Contributions 5

1.2.1 MMiDaS-AE 5

1.2.2 PGB 5

1.2.3 SR-CoMbEr 6

1.2.4 Hyperbolic Representation Learning 6

1.3 Organization 6

2 Background and Related Work 8

2.1 Systematic Review 8

2.2 Network Embedding 9

2.2.1 Graph Neural Networks 10

2.2.2 Graph Convolutional Networks 11

2.2.3 Heterogeneous Information Network Embedding 11

2.3 Hierarchical Structure Embeddings 12

2.3.1 Poincaré Embedding 13

2.3.2 Hyperbolic Entailment Cones 14

2.4 Systematic Review Datasets 15

2.5 Evaluation Metrics 17

3 Multi-modal Missing Data aware Stacked Autoencoder 18

3.1 Feature Representations 19

3.1.1 Document Representation 19

3.1.2 Topic Representation 20

3.1.3 Citation Network Representation 21

3.2 Model design 22

3.2.1 Multi-modal Stacked Autoencoder 23

3.2.2 Missing Data Imputation in Autoencoder 23

3.2.3 Multi-label Classification Task 25

3.3 Experimental Design 27

3.3.1 Data Preprocessing 27

3.3.2 Inter-topic Setting 28

3.3.3 Intra-topic Setting 29

3.3.4 Fine-tuning Setting 29

3.3.5 Hyperparameter Tuning 30

3.4 Empirical Results 31

3.4.1 Inter-topic Results 32

3.4.2 Fine-tuning and Intra-topic Results 32

3.4.3 Ablation Study 34

4 PubMed Graph Benchmark 38

4.1 Benchmark Comparison 40

4.2 Benchmark Construction 42

4.2.1 Paper Collection 42

4.2.2 Metadata Extraction from PubMed 42

4.2.3 Citation Extraction 43

4.2.4 MeSH Terms Hierarchy 44

4.3 Data Format 45

4.3.1 Statistics 46

4.3.2 Code and Data License Information 47

4.4 Experimental Design 48

4.4.1 Data Preprocessing 48

4.4.2 Baseline Models 49

4.4.3 Experimental Setup 51

4.5 Empirical Result 51

5 Community Multi-view based Enhanced Graph Convolutional Network 55

5.1 Model Design 56

5.1.1 Heterogeneous Community Detection 57

5.1.2 Community Multi-view Learning 59

5.1.3 Global Consensus 61

5.2 Experimental Design 62

5.2.1 Data Preprocessing 62

5.2.2 Baseline Models 63

5.2.3 Implementation Details 64

5.3 Empirical Results 65

5.4 Ablation Study 66

6 Hyperbolic Representation Learning for Graphs with Mixed Hierarchical and Non-hierarchical Structures 68

6.1 HypMix 71

6.1.1 Root Regularization 72

6.1.2 Child Regularizations 73

6.1.3 Non-hierarchical Structure Embedding 74

6.2 Experimental Design 75

6.2.1 Data Preprocessing 75

6.2.2 Statistics of Hierarchical Structures 75

6.2.3 Baseline Models 77

6.3 Empirical Results 78

6.4 Case Study 80

6.5 Impact of Dimension Size 81

7 Conclusion and Future Work 83

7.1 Conclusion 83

7.2 Future Work 84

Bibliography 87

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files