Privacy Preserving Medical Data Publishing 公开

Gardner, James (2012)

Permanent URL: https://etd.library.emory.edu/concern/etds/3484zh825?locale=zh
Published

Abstract

Abstract
Privacy Preserving Medical Data Publishing
There is an increasing need for sharing of medical information for pub-
lic health research. Data custodians and honest brokers have an ethical and
legal requirement to protect the privacy of individuals when publishing med-
ical datasets. This dissertation presents an end-to-end Health Information
DE-identification (HIDE) system and framework that promotes and enables
privacy preserving medical data publishing of textual, structured, and aggre-
gated statistics gleaned from electronic health records (EHRs). This work
reviews existing de-identification systems, personal health information (PHI)
detection, record anonymization, and differential privacy of multi-dimensional
data. HIDE integrates several state-of-the-art algorithms into a unified system
for privacy preserving medical data publishing. The system has been applied
to a variety of real-world and academic medical datasets. The main contri-
butions of HIDE include: 1) a conceptual framework and software system
for anonymizing heterogeneous health data, 2) an adaptation and evaluation
of information extraction techniques and modification of sampling techniques
for protected health information (PHI) and sensitive information extraction
in health data, and 3) applications and extension of privacy techniques to
provide privacy preserving publishing options to medical data custodians, in-
cluding de-identified record release with weak privacy and multidimensional
statistical data release with strong privacy.

Table of Contents

1 Introduction 1
1.1 Privacy............................... 2
1.2 Health Information DE-identification . . . . . . . . . . . . . . 3
1.2.1 Overview ......................... 4
1.2.2 Contributions ....................... 4
1.3 Organization ........................... 5
2 Background and Related Work 6
2.1 Existing medical record de-identification systems . . . . . . . . 6
2.2 Privacy preserving data publishing ............... 10
2.2.1 De-identification options specified by HIPAA . . . . . . 11
2.2.2 General anonymization principles . . . . . . . . . . . . 12
2.3 Formal principles ......................... 13
2.3.1 Weak privacy ....................... 14
2.3.2 Strong privacy....................... 15
2.4 Discussion............................. 20
3 HIDE Framework 21
3.1 Overview.............................. 21
3.2 Health information extraction .................. 23
3.3 Data linking............................ 23
3.4 Privacy models .......................... 24
3.4.1 Weak privacy through structured anonymization . . . . 25
3.4.2 Strong privacy through differentially private data cubes 25
3.5 Heterogeneous Medical Data................... 26
3.5.1 Formats .......................... 26
3.5.2 Datasets used in this dissertation . . . . . . . . . . . . 27
3.6 Software.............................. 30
3.7 Discussion............................. 31
4 Health Information Extraction 33
4.1 Modeling PHI detection ..................... 34
4.2 Conditional Random Field background . . . . . . . . . . . . . 37
4.2.1 Features and Sequence Labeling . . . . . . . . . . . . . 37
4.2.2 From Generative to Discriminative . . . . . . . . . . . 38
4.2.3 Definition ......................... 41
4.2.4 ParameterLearning.................... 46
4.3 Metrics............................... 50
4.4 Feature sets ............................ 51
4.4.1 Regular expression features ............... 51
4.4.2 Affix features ....................... 52
4.4.3 Dictionary features .................... 53
4.4.4 Context features ..................... 53
4.4.5 Experiments........................ 53
4.5 Sampling.............................. 57
4.5.1 Cost-proportionate sampling............... 57
4.5.2 Random O-sampling ................... 58
4.5.3 Window sampling..................... 59
4.5.4 Experiments........................ 59
4.6 Discussion............................. 66
5 Privacy-Preserving Publishing 68
5.1 Weak privacy ........................... 69
5.1.1 Mondrian Algorithm ................... 69
5.1.2 Count Queries on Extracted PHI . . . . . . . . . . . . 70
5.2 Strong privacy........................... 72
5.2.1 Differentially private datacubes . . . . . . . . . . . . . 73
5.2.2 DPCube algorithm .................... 76
5.2.3 Temporal queries ..................... 79
5.3 Evaluations ............................ 82
5.3.1 Distribution accuracy................... 83
5.3.2 Information gain threshold................ 88
5.3.3 Trend accuracy ...................... 89
5.3.4 Temporal queries ..................... 90
5.3.5 Applying DPCube to temporal data. . . . . . . . . . . 92
5.3.6 Applying tree-based approach to temporal data . . . . 93
5.4 Discussion............................. 97
6 Conclusion and Future Work 98
6.1 Integration............................. 99
6.2 Extension of prefix tree approach ................ 99
6.3 Combining unstructured data .................. 101
6.4 Larger-scale statistical analysis ................. 101
6.5 Clinical use cases ......................... 102
6.6 Conclusion............................. 103

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research field
关键词
Committee Chair / Thesis Advisor
Committee Members
最新修改

Primary PDF

Supplemental Files