Advanced Sparse Concept Detection and Recognition in Biomedical Texts Via Few-Shot Learning Algorithms Restricted; Files Only

Ge, Yao (Fall 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/vx021g70r?locale=en
Published

Abstract

Many natural language processing (NLP) problems involving biomedical texts have limited annotated data available. Traditional supervised machine learning and deep learning algorithms require large volumes of annotated data and underperform with small annotated datasets. Few-shot learning (FSL) methods aim to enable effective learning in the absence of large annotated datasets, but the performances of FSL-based NLP methods are suboptimal, particularly for biomedical texts, limiting their application in real-world settings. The overarching objective of this thesis is to rigorously validate the current state-of-the-art in FSL methods for named entity recognition (NER) from biomedical texts and to propose novel FSL approaches that can improve upon the state-of-the-art methods.

Given the emerging interest and early-stage development of FSL approaches in biomedical NLP, we conducted a systematic review and benchmarking of existing methods, revealing their underperformance on most biomedical datasets. To address data sparsity problems in FSL, we proposed a novel method combining data augmentation with a nearest neighbor classifier (DANN). We extended this method by adding a synthetic data generation module (HILGEN) that leverages hierarchical information of the Unified Medical Language System (UMLS) and information generated by large language models (LLMs). Finally, building on progress made in recent times, we further enhanced NER performance by leveraging LLMs with prompt engineering and a dynamic prompting strategy involving retrieval-augmented generation (RAG).

These methods improved NER performance across multiple datasets in FSL settings, including MIMIC III, NCBI disease, BC5CDR, and a dataset (Reddit-Impacts) specifically created as part of this research. For example, on MIMIC III in a 5-shot setting, BERT’s near-zero F1 score improved to 19.69 with our DANN model, 58.68 with HILGEN-generated synthetic data, and 76.24 using RAG-based dynamic prompting. Similar gains were observed across other datasets. Our research demonstrates that combining enriched data representation, domain knowledge, synthetic data, and context-aware prompting effectively addresses data sparsity, enhancing biomedical NER in FSL settings. These advancements mark significant progress toward operationalizing FSL-based NER systems for biomedical applications.

Table of Contents

1   Introduction 1

1.1   Overview 1

1.2   Few-shot Learning for Biomedical Named Entity Recognition 3

1.2.1   Few-shot Learning 4

1.2.2   FSL for Biomedical NER 4

1.2.3   Early Approaches to FSL for NER 6

1.2.4   LLMs on FSL for Biomedical NER 9

1.3   Research Questions 11

1.4   Thesis Outline 14

2   Literature Review 17

2.1   Search strategy   17

2.2   Study selection and exclusion criteria   19

2.3   Data abstraction and synthesis   20

2.4   Results   21

2.4.1   Data collection results   21

2.4.2   Dimensions of characterization   22

2.4.3   Data characteristics   23

2.4.4   A summary of methodologies   24

2.4.5   Performance ranges   26

2.5   Discussion   26

3   Datasets   28

3.1   Publicly Available Datasets   28

3.2   Reddit-Impacts Dataset   31

3.2.1   Data collection   32

3.2.2   Annotation   33

3.2.3   Dataset creation   35

4   Few-shot Learning for Biomedical NER: Benchmarking Studies   37

4.1   Traditional and FSL NER Models   38

4.1.1   Traditional NER Models   38

4.1.2   Few-shot Learning NER Models   40

4.2   Data Collection and Preparation   41

4.3   Experimental Setup   42

4.4   Results   42

4.5   Discussion   44

4.6   Conclusion   47

5   Data Augmentation with Nearest Neighbor Classifier   48

5.1   Proposed Approach   49

5.1.1   Different Distance Methods   52

5.2   Results and Discussion   53

5.3   Conclusion   55

6   HILGEN: Hierarchically-Informed Data Generation for Biomedical NER Using Knowledge Bases and LLMs   57

6.1   Background   58

6.1.1   UMLS in Biomedical Natural Language   58

6.1.2   Synthetic Data Generation   59

6.2   Proposed Approach   59

6.2.1   Hierarchical Information and Semantic Network in UMLS   60

6.2.2   UMLS-Based Data Generation   62

6.2.3   GPT-Based Data Generation   63

6.2.4   Fine-Tuning with Transformer-Based and Few-Shot Learning Models   64

6.2.5   Ensemble Method   65

6.2.6   Comparison with ZEROGEN   65

6.3   Datasets and Experiment Setup   65

6.4   Results   66

6.4.1   Experimental Results   66

6.4.2   Comparison with ZEROGEN   67

6.4.3   Ensemble Approach   68

6.5   Discussion   69

6.5.1   Challenges of Zero-Shot Data Generation Approaches   69

6.5.2   Impact of Ensemble Learning on Model Generalization   70

6.6   Limitations   71

6.7   Conclusion   71

7   From Static to Dynamic: RAG-based Dynamic Prompting for Few- shot Learning   73

7.1   Background   74

7.1.1   Retrieval-Augmented Generation   74

7.2   Proposed Approach   75

7.2.1   Static Prompt Engineering   75

7.2.2   Dynamic Prompt Engineering   79

7.3   Experimental Setup   82

7.4   Results   83

7.4.1   Task-specific Static Prompting   83

7.4.2   Dynamic Prompting with RAG   86

7.5   Discussion   92

7.5.1   Analysis of Different LLMs   92

7.5.2   Performance Improvements via RAG-based Prompting   92

7.5.3   Variability in the Impact of Shot Size   93

7.6   Limitations   94

7.7   Conclusion   94

8   Conclusion   96

8.1   Future Work   97

8.1.1   Advancing Biomedical NER with Technical Innovations   97

8.1.2   Applications of LLMs in Few-shot BioNER   98

Appendix A Tables for Literature Review   101

Appendix B Detailed Task-specific Static Prompts   113

Appendix C Averaged Performance of the Baseline Dynamic Prompt Model 117

Appendix D Results of 95% CIs for Each Metric 121

Bibliography 124

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified Preview image embargoed

Primary PDF

Supplemental Files