Machine Learning Methods for Biomedical Keyphrase Extraction Open Access

Gero, Zelalem (Fall 2021)

Permanent URL:


Due to the increased generation and digitization of text documents on the Internet and digital libraries, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Efficient automated methods that extract keywords to retrieve the salient concepts of a document are shown to be of a paramount importance in text analysis, document summarization, topic detection, and recommendation systems among others.  Various machine learning approaches have been proposed to solve the problem of keyword extraction but the results still lag other tasks such as document classification. The task of keyword extraction in biomedical domain is even more daunting since the literature is highly domain specific and general methods do not translate well. To deal with these problems we propose 1) an unsupervised extraction method based on phrase-embeddings and modified pagerank algorithm which converges faster and performs better than related baseline methods; 2) A deep learning method that pays more attention to words that are central to the document’s semantics; 3) a semi-supervised deep learning approach to harness vastly available unannotated biomedical data that improves keyword extraction based on uncertainty estimation. 4) An encoder-decoder based extraction for Medical Subject Heading (MeSH) indexing.

Table of Contents

1      Introduction

1.1  What Constitutes a Keyphrase

1.2  Contributions

1.3  Outline

2      Unsupervised Keyphrase Extraction

2.1  Introduction

2.1.1      Related Work

2.1.2      Graph-based Methods

2.2  Proposed Model: NamedKeys

2.2.1      Candidate Keyphrase Generation

2.2.2      Phrase Embedding: PMCVec

2.2.3      Phrase Quality

2.2.4      Candidate Clustering and Ranking

2.3  Experiments

2.3.1      Dataset

2.3.2      Baseline Methods

2.3.3      Conclusion

3      Supervised Keyphrase Extraction

3.1        Introduction

3.2        Related Work

3.3  Methodology

3.3.1      Word Embedding Layer

3.3.2      BiLSTM Layer

3.3.3      Centrality Weighting Layer

3.3.4      Conditional Random Fields (CRF)

3.4  Experiments

3.4.1      Datasets

3.4.2      Experimental Settings

3.4.3      Results

3.4.4      Conclusion

4      Semi-Supervised Keyphrase Extraction

4.1        Introduction

4.2        Related Work

4.3  Methodology

4.3.1      BiLSTM-CRF Architecture

4.3.2      Self-training and Uncertainty Estimation

4.4  Experiments

4.4.1      Datasets

4.4.2      Experimental Settings

4.4.3      Evaluation Results

5      MeSH Indexing: Keyphrase Extraction from Controlled Vocabulary

5.1        Introduction

5.2        Related Work

5.3        Proposed Model: Encoder-Decoder with RL for MeSH Indexing

5.3.1      Encoder

5.3.2      Decoder

5.3.3      Reinforcement Learning for seq2seq Training 

5.3.4      Conditional Random Fields (CRF)

5.4  Experimental Results

5.4.1      Dataset

5.4.2      Evaluation and Results

             7.5 Conclusion

6      Conclusion and Future Work        

7      Bibliography

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files