Privacy Preserving Medical Data Publishing Open Access
Gardner, James (2012)
Abstract
Abstract
Privacy Preserving Medical Data Publishing
There is an increasing need for sharing of medical information for
pub-
lic health research. Data custodians and honest brokers have an
ethical and
legal requirement to protect the privacy of individuals when
publishing med-
ical datasets. This dissertation presents an end-to-end Health
Information
DE-identification (HIDE) system and framework that promotes
and enables
privacy preserving medical data publishing of textual, structured,
and aggre-
gated statistics gleaned from electronic health records (EHRs).
This work
reviews existing de-identification systems, personal health
information (PHI)
detection, record anonymization, and differential privacy of
multi-dimensional
data. HIDE integrates several state-of-the-art algorithms into a
unified system
for privacy preserving medical data publishing. The system has been
applied
to a variety of real-world and academic medical datasets. The main
contri-
butions of HIDE include: 1) a conceptual framework and software
system
for anonymizing heterogeneous health data, 2) an adaptation and
evaluation
of information extraction techniques and modification of
sampling techniques
for protected health information (PHI) and sensitive information
extraction
in health data, and 3) applications and extension of privacy
techniques to
provide privacy preserving publishing options to medical data
custodians, in-
cluding de-identified record release with weak privacy and
multidimensional
statistical data release with strong privacy.
Table of Contents
1 Introduction 1
1.1 Privacy............................... 2
1.2 Health Information DE-identification . . . . . . . . . . . . .
. 3
1.2.1 Overview ......................... 4
1.2.2 Contributions ....................... 4
1.3 Organization ........................... 5
2 Background and Related Work 6
2.1 Existing medical record de-identification systems . . . . . . .
. 6
2.2 Privacy preserving data publishing ............... 10
2.2.1 De-identification options specified by HIPAA . . . . . .
11
2.2.2 General anonymization principles . . . . . . . . . . . .
12
2.3 Formal principles ......................... 13
2.3.1 Weak privacy ....................... 14
2.3.2 Strong privacy....................... 15
2.4 Discussion............................. 20
3 HIDE Framework 21
3.1 Overview.............................. 21
3.2 Health information extraction .................. 23
3.3 Data linking............................ 23
3.4 Privacy models .......................... 24
3.4.1 Weak privacy through structured anonymization . . . .
25
3.4.2 Strong privacy through differentially private data cubes
25
3.5 Heterogeneous Medical Data................... 26
3.5.1 Formats .......................... 26
3.5.2 Datasets used in this dissertation . . . . . . . . . . . .
27
3.6 Software.............................. 30
3.7 Discussion............................. 31
4 Health Information Extraction 33
4.1 Modeling PHI detection ..................... 34
4.2 Conditional Random Field background . . . . . . . . . . . . .
37
4.2.1 Features and Sequence Labeling . . . . . . . . . . . . .
37
4.2.2 From Generative to Discriminative . . . . . . . . . . .
38
4.2.3 Definition ......................... 41
4.2.4 ParameterLearning.................... 46
4.3 Metrics............................... 50
4.4 Feature sets ............................ 51
4.4.1 Regular expression features ............... 51
4.4.2 Affix features ....................... 52
4.4.3 Dictionary features .................... 53
4.4.4 Context features ..................... 53
4.4.5 Experiments........................ 53
4.5 Sampling.............................. 57
4.5.1 Cost-proportionate sampling............... 57
4.5.2 Random O-sampling ................... 58
4.5.3 Window sampling..................... 59
4.5.4 Experiments........................ 59
4.6 Discussion............................. 66
5 Privacy-Preserving Publishing 68
5.1 Weak privacy ........................... 69
5.1.1 Mondrian Algorithm ................... 69
5.1.2 Count Queries on Extracted PHI . . . . . . . . . . . .
70
5.2 Strong privacy........................... 72
5.2.1 Differentially private datacubes . . . . . . . . . . . . .
73
5.2.2 DPCube algorithm .................... 76
5.2.3 Temporal queries ..................... 79
5.3 Evaluations ............................ 82
5.3.1 Distribution accuracy................... 83
5.3.2 Information gain threshold................ 88
5.3.3 Trend accuracy ...................... 89
5.3.4 Temporal queries ..................... 90
5.3.5 Applying DPCube to temporal data. . . . . . . . . . .
92
5.3.6 Applying tree-based approach to temporal data . . . .
93
5.4 Discussion............................. 97
6 Conclusion and Future Work 98
6.1 Integration............................. 99
6.2 Extension of prefix tree approach ................ 99
6.3 Combining unstructured data .................. 101
6.4 Larger-scale statistical analysis ................. 101
6.5 Clinical use cases ......................... 102
6.6 Conclusion............................. 103
About this Dissertation
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Privacy Preserving Medical Data Publishing () | 2018-08-28 16:30:11 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|