Advanced Sparse Concept Detection and Recognition in Biomedical Texts Via Few-Shot Learning Algorithms Restricted; Files Only

Ge, Yao (Fall 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/vx021g70r?locale=en

Published

Abstract

Many natural language processing (NLP) problems involving biomedical texts have limited annotated data available. Traditional supervised machine learning and deep learning algorithms require large volumes of annotated data and underperform with small annotated datasets. Few-shot learning (FSL) methods aim to enable effective learning in the absence of large annotated datasets, but the performances of FSL-based NLP methods are suboptimal, particularly for biomedical texts, limiting their application in real-world settings. The overarching objective of this thesis is to rigorously validate the current state-of-the-art in FSL methods for named entity recognition (NER) from biomedical texts and to propose novel FSL approaches that can improve upon the state-of-the-art methods.

Given the emerging interest and early-stage development of FSL approaches in biomedical NLP, we conducted a systematic review and benchmarking of existing methods, revealing their underperformance on most biomedical datasets. To address data sparsity problems in FSL, we proposed a novel method combining data augmentation with a nearest neighbor classifier (DANN). We extended this method by adding a synthetic data generation module (HILGEN) that leverages hierarchical information of the Unified Medical Language System (UMLS) and information generated by large language models (LLMs). Finally, building on progress made in recent times, we further enhanced NER performance by leveraging LLMs with prompt engineering and a dynamic prompting strategy involving retrieval-augmented generation (RAG).

These methods improved NER performance across multiple datasets in FSL settings, including MIMIC III, NCBI disease, BC5CDR, and a dataset (Reddit-Impacts) specifically created as part of this research. For example, on MIMIC III in a 5-shot setting, BERT’s near-zero F1 score improved to 19.69 with our DANN model, 58.68 with HILGEN-generated synthetic data, and 76.24 using RAG-based dynamic prompting. Similar gains were observed across other datasets. Our research demonstrates that combining enriched data representation, domain knowledge, synthetic data, and context-aware prompting effectively addresses data sparsity, enhancing biomedical NER in FSL settings. These advancements mark significant progress toward operationalizing FSL-based NER systems for biomedical applications.

1 Introduction 1

1.1 Overview 1

1.2 Few-shot Learning for Biomedical Named Entity Recognition 3

1.2.1 Few-shot Learning 4

1.2.2 FSL for Biomedical NER 4

1.2.3 Early Approaches to FSL for NER 6

1.2.4 LLMs on FSL for Biomedical NER 9

1.3 Research Questions 11

1.4 Thesis Outline 14

2 Literature Review 17

2.1 Search strategy 17

2.2 Study selection and exclusion criteria 19

2.3 Data abstraction and synthesis 20

2.4 Results 21

2.4.1 Data collection results 21

2.4.2 Dimensions of characterization 22

2.4.3 Data characteristics 23

2.4.4 A summary of methodologies 24

2.4.5 Performance ranges 26

2.5 Discussion 26

3 Datasets 28

3.1 Publicly Available Datasets 28

3.2 Reddit-Impacts Dataset 31

3.2.1 Data collection 32

3.2.2 Annotation 33

3.2.3 Dataset creation 35

4 Few-shot Learning for Biomedical NER: Benchmarking Studies 37

4.1 Traditional and FSL NER Models 38

4.1.1 Traditional NER Models 38

4.1.2 Few-shot Learning NER Models 40

4.2 Data Collection and Preparation 41

4.3 Experimental Setup 42

4.4 Results 42

4.5 Discussion 44

4.6 Conclusion 47

5 Data Augmentation with Nearest Neighbor Classifier 48

5.1 Proposed Approach 49

5.1.1 Different Distance Methods 52

5.2 Results and Discussion 53

5.3 Conclusion 55

6 HILGEN: Hierarchically-Informed Data Generation for Biomedical NER Using Knowledge Bases and LLMs 57

6.1 Background 58

6.1.1 UMLS in Biomedical Natural Language 58

6.1.2 Synthetic Data Generation 59

6.2 Proposed Approach 59

6.2.1 Hierarchical Information and Semantic Network in UMLS 60

6.2.2 UMLS-Based Data Generation 62

6.2.3 GPT-Based Data Generation 63

6.2.4 Fine-Tuning with Transformer-Based and Few-Shot Learning Models 64

6.2.5 Ensemble Method 65

6.2.6 Comparison with ZEROGEN 65

6.3 Datasets and Experiment Setup 65

6.4 Results 66

6.4.1 Experimental Results 66

6.4.2 Comparison with ZEROGEN 67

6.4.3 Ensemble Approach 68

6.5 Discussion 69

6.5.1 Challenges of Zero-Shot Data Generation Approaches 69

6.5.2 Impact of Ensemble Learning on Model Generalization 70

6.6 Limitations 71

6.7 Conclusion 71

7 From Static to Dynamic: RAG-based Dynamic Prompting for Few- shot Learning 73

7.1 Background 74

7.1.1 Retrieval-Augmented Generation 74

7.2 Proposed Approach 75

7.2.1 Static Prompt Engineering 75

7.2.2 Dynamic Prompt Engineering 79

7.3 Experimental Setup 82

7.4 Results 83

7.4.1 Task-specific Static Prompting 83

7.4.2 Dynamic Prompting with RAG 86

7.5 Discussion 92

7.5.1 Analysis of Different LLMs 92

7.5.2 Performance Improvements via RAG-based Prompting 92

7.5.3 Variability in the Impact of Shot Size 93

7.6 Limitations 94

7.7 Conclusion 94

8 Conclusion 96

8.1 Future Work 97

8.1.1 Advancing Biomedical NER with Technical Innovations 97

8.1.2 Applications of LLMs in Few-shot BioNER 98

Appendix A Tables for Literature Review 101

Appendix B Detailed Task-specific Static Prompts 113

Appendix C Averaged Performance of the Baseline Dynamic Prompt Model 117

Appendix D Results of 95% CIs for Each Metric 121

Bibliography 124

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Computer Science and Informatics
Degree	Ph.D.
Submission	Dissertation
Language	English
Research Field	Information Science Artificial Intelligence Computer Science
Keyword	Few-shot Learning Natural Language Processing Information Extraction Named Entity Recognition Biomedical Informatics Large Language Models
Committee Chair / Thesis Advisor	Sarker, Abeed, Emory University
Committee Members	Ho, Joyce C., Emory University Al-Garadi, Mohammed Ali, Vanderbilt University McKay, J Lucas, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	File download under embargo until 09 January 2026	2024-12-10 02:02:36 -0500	File download under embargo until 09 January 2026

Advanced Sparse Concept Detection and Recognition in Biomedical Texts Via Few-Shot Learning Algorithms Restricted; Files Only

Ge, Yao (Fall 2024)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files