Predicting Disease Comorbidity by Mining Large Text Corpora Open Access

Askew IV, Walter Scott (2009)

Permanent URL: https://etd.library.emory.edu/concern/etds/736664895?locale=en
Published

Abstract

Natural language processing techniques have a variety of applications in the public health field. This paper discusses a method for predicting whether two diseases are frequently comorbid. A system is presented which applies previous work into using textual information to compute similarity between words to predict disease comorbidity. The work is based on the assumption that the rate of comorbidity between two diseases should be reflected by linguistic similarity of their cooccurrences. Perhaps most excitingly, the paper demonstrates that corpora such as web forums provide useful data for training the system. The ability to mine web based sources for new medical information has many exciting implications in public health. The web could be used to monitor disease trends and epidemic outbreaks, and to uncover new medical knowledge directly from disease suffers. The evaluation of this system shows that it performs above baseline levels in predicting frequency of comorbidity between diseases.

Table of Contents

Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Methodology 1
2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Counting Cooccurrence . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Similarity Calculation . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Classifier Training . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Results 10
4.1 Medline Corpus Cross-Validation . . . . . . . . . . . . . . . . 10
4.2 Psych Forums Corpus Cross-Validation . . . . . . . . . . . . 11
4.3 Classifiers Trained On NCSR Truth Data and Validated on
CHIS 2005 Truth Data . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

About this thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files