Privacy Preserving Medical Data Publishing 公开

Gardner, James (2012)

Permanent URL: https://etd.library.emory.edu/concern/etds/3484zh825?locale=zh

Published

Abstract

Abstract
Privacy Preserving Medical Data Publishing
There is an increasing need for sharing of medical information for pub-
lic health research. Data custodians and honest brokers have an ethical and
legal requirement to protect the privacy of individuals when publishing med-
ical datasets. This dissertation presents an end-to-end Health Information
DE-identiﬁcation (HIDE) system and framework that promotes and enables
privacy preserving medical data publishing of textual, structured, and aggre-
gated statistics gleaned from electronic health records (EHRs). This work
reviews existing de-identiﬁcation systems, personal health information (PHI)
detection, record anonymization, and diﬀerential privacy of multi-dimensional
data. HIDE integrates several state-of-the-art algorithms into a uniﬁed system
for privacy preserving medical data publishing. The system has been applied
to a variety of real-world and academic medical datasets. The main contri-
butions of HIDE include: 1) a conceptual framework and software system
for anonymizing heterogeneous health data, 2) an adaptation and evaluation
of information extraction techniques and modiﬁcation of sampling techniques
for protected health information (PHI) and sensitive information extraction
in health data, and 3) applications and extension of privacy techniques to
provide privacy preserving publishing options to medical data custodians, in-
cluding de-identiﬁed record release with weak privacy and multidimensional
statistical data release with strong privacy.

1 Introduction 1
1.1 Privacy............................... 2
1.2 Health Information DE-identification . . . . . . . . . . . . . . 3
1.2.1 Overview ......................... 4
1.2.2 Contributions ....................... 4
1.3 Organization ........................... 5
2 Background and Related Work 6
2.1 Existing medical record de-identification systems . . . . . . . . 6
2.2 Privacy preserving data publishing ............... 10
2.2.1 De-identification options specified by HIPAA . . . . . . 11
2.2.2 General anonymization principles . . . . . . . . . . . . 12
2.3 Formal principles ......................... 13
2.3.1 Weak privacy ....................... 14
2.3.2 Strong privacy....................... 15
2.4 Discussion............................. 20
3 HIDE Framework 21
3.1 Overview.............................. 21
3.2 Health information extraction .................. 23
3.3 Data linking............................ 23
3.4 Privacy models .......................... 24
3.4.1 Weak privacy through structured anonymization . . . . 25
3.4.2 Strong privacy through differentially private data cubes 25
3.5 Heterogeneous Medical Data................... 26
3.5.1 Formats .......................... 26
3.5.2 Datasets used in this dissertation . . . . . . . . . . . . 27
3.6 Software.............................. 30
3.7 Discussion............................. 31
4 Health Information Extraction 33
4.1 Modeling PHI detection ..................... 34
4.2 Conditional Random Field background . . . . . . . . . . . . . 37
4.2.1 Features and Sequence Labeling . . . . . . . . . . . . . 37
4.2.2 From Generative to Discriminative . . . . . . . . . . . 38
4.2.3 Definition ......................... 41
4.2.4 ParameterLearning.................... 46
4.3 Metrics............................... 50
4.4 Feature sets ............................ 51
4.4.1 Regular expression features ............... 51
4.4.2 Affix features ....................... 52
4.4.3 Dictionary features .................... 53
4.4.4 Context features ..................... 53
4.4.5 Experiments........................ 53
4.5 Sampling.............................. 57
4.5.1 Cost-proportionate sampling............... 57
4.5.2 Random O-sampling ................... 58
4.5.3 Window sampling..................... 59
4.5.4 Experiments........................ 59
4.6 Discussion............................. 66
5 Privacy-Preserving Publishing 68
5.1 Weak privacy ........................... 69
5.1.1 Mondrian Algorithm ................... 69
5.1.2 Count Queries on Extracted PHI . . . . . . . . . . . . 70
5.2 Strong privacy........................... 72
5.2.1 Differentially private datacubes . . . . . . . . . . . . . 73
5.2.2 DPCube algorithm .................... 76
5.2.3 Temporal queries ..................... 79
5.3 Evaluations ............................ 82
5.3.1 Distribution accuracy................... 83
5.3.2 Information gain threshold................ 88
5.3.3 Trend accuracy ...................... 89
5.3.4 Temporal queries ..................... 90
5.3.5 Applying DPCube to temporal data. . . . . . . . . . . 92
5.3.6 Applying tree-based approach to temporal data . . . . 93
5.4 Discussion............................. 97
6 Conclusion and Future Work 98
6.1 Integration............................. 99
6.2 Extension of prefix tree approach ................ 99
6.3 Combining unstructured data .................. 101
6.4 Larger-scale statistical analysis ................. 101
6.5 Clinical use cases ......................... 102
6.6 Conclusion............................. 103

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Math and Computer Science
Degree	PhD
Submission	Dissertation
Language	English
Research Field	Computer Science
关键词	Information Extraction De-identificiation Natural Language Processing Privacy Medical Records
Committee Chair / Thesis Advisor	Xiong, Li, Emory University
Committee Members	Lu, James, Emory University Post, Andrew, Emory University Agichtein, Eugene, Emory University

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Privacy Preserving Medical Data Publishing ()	2018-08-28 16:30:11 -0400	Download

Privacy Preserving Medical Data Publishing 公开

Gardner, James (2012)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files