Machine Learning Methods for Biomedical Keyphrase Extraction Open Access

Gero, Zelalem (Fall 2021)

Permanent URL: https://etd.library.emory.edu/concern/etds/pz50gx578?locale=en%5D

Published

Abstract

Due to the increased generation and digitization of text documents on the Internet and digital libraries, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Efficient automated methods that extract keywords to retrieve the salient concepts of a document are shown to be of a paramount importance in text analysis, document summarization, topic detection, and recommendation systems among others. Various machine learning approaches have been proposed to solve the problem of keyword extraction but the results still lag other tasks such as document classification. The task of keyword extraction in biomedical domain is even more daunting since the literature is highly domain specific and general methods do not translate well. To deal with these problems we propose 1) an unsupervised extraction method based on phrase-embeddings and modified pagerank algorithm which converges faster and performs better than related baseline methods; 2) A deep learning method that pays more attention to words that are central to the document’s semantics; 3) a semi-supervised deep learning approach to harness vastly available unannotated biomedical data that improves keyword extraction based on uncertainty estimation. 4) An encoder-decoder based extraction for Medical Subject Heading (MeSH) indexing.

1 Introduction

1.1 What Constitutes a Keyphrase

1.2 Contributions

1.3 Outline

2 Unsupervised Keyphrase Extraction

2.1 Introduction

2.1.1 Related Work

2.1.2 Graph-based Methods

2.2 Proposed Model: NamedKeys

2.2.1 Candidate Keyphrase Generation

2.2.2 Phrase Embedding: PMCVec

2.2.3 Phrase Quality

2.2.4 Candidate Clustering and Ranking

2.3 Experiments

2.3.1 Dataset

2.3.2 Baseline Methods

2.3.3 Conclusion

3 Supervised Keyphrase Extraction

3.1 Introduction

3.2 Related Work

3.3 Methodology

3.3.1 Word Embedding Layer

3.3.2 BiLSTM Layer

3.3.3 Centrality Weighting Layer

3.3.4 Conditional Random Fields (CRF)

3.4 Experiments

3.4.1 Datasets

3.4.2 Experimental Settings

3.4.3 Results

3.4.4 Conclusion

4 Semi-Supervised Keyphrase Extraction

4.1 Introduction

4.2 Related Work

4.3 Methodology

4.3.1 BiLSTM-CRF Architecture

4.3.2 Self-training and Uncertainty Estimation

4.4 Experiments

4.4.1 Datasets

4.4.2 Experimental Settings

4.4.3 Evaluation Results

5 MeSH Indexing: Keyphrase Extraction from Controlled Vocabulary

5.1 Introduction

5.2 Related Work

5.3 Proposed Model: Encoder-Decoder with RL for MeSH Indexing

5.3.1 Encoder

5.3.2 Decoder

5.3.3 Reinforcement Learning for seq2seq Training

5.3.4 Conditional Random Fields (CRF)

5.4 Experimental Results

5.4.1 Dataset

5.4.2 Evaluation and Results

7.5 Conclusion

6 Conclusion and Future Work

7 Bibliography

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Computer Science and Informatics
Degree	Ph.D.
Submission	Dissertation
Language	English
Research Field	Computer Science Biology, Bioinformatics
Keyword	Biomedical Informatics Keyphrase Extraction Machine Learning
Committee Chair / Thesis Advisor	Joyce C.Ho, Emory University
Committee Members	Tristan Naumann, Microsoft Research Imon Banerjee, Emory University Abeed Sarker, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Machine Learning Methods for Biomedical Keyphrase Extraction ()	2021-10-19 11:49:19 -0400	Download

Machine Learning Methods for Biomedical Keyphrase Extraction Open Access

Gero, Zelalem (Fall 2021)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files