Designing an integrated analytic database for lymphoma patient research using an open source toolset: Design and evaluation of elements for the creation of the Georgia Patient Analytic Lymphoma Registry (GA-PAL). Open Access

Harris, Wayne A C (2014)

Permanent URL: https://etd.library.emory.edu/concern/etds/qf85nb862?locale=en
Published

Abstract

Increasingly, informatics is having a significant impact on the management, analysis, and reporting of health data. As the field has matured, improved tools have evolved for these purposes and identified greater application in the healthcare research setting. Still, significant challenges remain particularly in the collection, integration, and analysis of health data. The complexity of unstructured data stored in huge data silos at healthcare institutions and lack of standardization contribute to the challenges. Another consideration is the steadily and exponentially growing stream of data that is becoming harder to manage and interpret. These challenges present a level of complexity that is difficult to overcome. In this project, we describe methods to use existing data integration tools to construct a lymphoma patient database and constructed an ontology to link ICD-9 coded electronic health record data with ICD-O-3 coded cancer registry data. The Georgia Patient Analytic Lymphoma Registry database (GA-PAL) is based on an open source analytic, semantically driven informatics platform, Eureka Clinical Analytics, under development here at Emory University. This platform leverages a suite of applications to provide the desired functionality. Protégé (http://protege.stanford.edu, Stanford University) is the ontology management component. Data extraction and transformation is achieved by PROTEMPA a temporal data abstraction technology. All of the data is finally imported into I2B2, a database platform, where data can be queried using ontology concepts as well as derived or user defined variables. We created a database of 12491 patients with defined diagnosis of lymphoma by ICD-9 codes from 1992-2012. A simple query of this data set for patients receiving RCHOP chemotherapy regimen produced a subset of 3082 patients. This conflicted with the data we also received from the hospital cancer registry that indicated there were about 4500 confirmed cases of lymphoma diagnosed during the same period.

Although challenges still remain to achieving full functionality, the use of this open source solution to a prevailing problem shows great promise. Our work here draws upon the previous work done to develop the LEAD database architecture based on the caBIG platform(Huang et al. 2009) .

Table of Contents

Table of Contents

1. Introduction ..............................................................................1

A. Background ................................................................................ 1

B. Problem Statement........................................................................ 8

C. Purpose Statement.........................................................................9

D.Significance Statement.....................................................................10

E. Definition of Terms........................................................................ 12

2. Review of previous integration efforts (literature review) ..........13

A. Large Linked Databases..................................................................13

B. Cancer Bioinformatics Grid (CaBIG)................................................... 14

C. The Lymphoma Enterprise Architecture Database.................................... 15

D. Summary of the current problem........................................................ 17

3. Methodology.......................................................................................17

A. Introduction - Georgia Patient Analytic Lymphoma Registry (GA-PAL)..................17

B. Requirements..........................................................................................19

C. Proposed Solution.................................................................................. 21

  • Platform Improvements..................................................................... 24

  • Functional Improvements................................................................... 25

  • Process Improvements....................................................................... 27

D. Building the Mapping Ontology.................................................................. 28

E. Analysis of Variables for Data Integration......................................................31

4. Results................................................................................................34

5. Discussion............................................................................................37

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Subfield / Discipline
Degree
Submission
Language
  • English
Research field
Keyword
Committee Chair / Thesis Advisor
Partnering Agencies
Last modified

Primary PDF

Supplemental Files