Abstract
We live in the age of big data. With an increasing number of
people, devices, and sensors connected with digital networks,
individual data now can be largely collected and analyzed by data
mining applications for social good as well as for commercial
interests. However, the data generated by individual users exhibit
unique behavioral patterns and sensitive information, and therefore
must be transformed prior to the release for analysis. The AOL
search log release in 2006 is an example of privacy catastrophe,
where the searches of an innocent citizen were quickly
re-identified by a newspaper journalist. In this dissertation, I
present a novel framework to release continuous aggregation of
private data for an important class of real-time data mining tasks,
such as disease outbreak detection and web mining, to name a few.
The key innovation is that the proposed framework captures the
underlying dynamics of the continual aggregate statistics with time
series state-space models, and simultaneously adopts filtering
techniques to correct the observed, noisy data. It can be shown
that the new framework provides a rigorous, provable privacy
guarantee to individual data contributors without compromising the
output analysis results. Extensive empirical studies confirm that
it will enable privacy-preserving data analytical tasks in a broad
range of application domains.
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 1
1.2 Research Contributions . . . . . . . . . . . . . . . . . . . .
. . 4
1.2.1 Real-Time Aggregate Monitoring with Differential Privacy
(Chapter 3) . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Estimating Individual Privacy Risk with Domain Statistics
(Chapter 4) . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Modeling Spatio-Temporal Correlation of Multi-Variate
Aggregate Time Series (Chapter 5) . . . . . . . . . . . 6
2 Related Works 8
2.1 Differential Privacy . . . . . . . . . . . . . . . . . . . . .
. . . 8
2.2 Time Series Analysis and Privacy . . . . . . . . . . . . . . .
. 9
2.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 10
2.4 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . .
. 11
3 Real-Time Aggregate Monitoring with Differential Privacy 13
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . .
. . 13
3.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . .
13
3.1.2 Differential Privacy . . . . . . . . . . . . . . . . . . . .
14
3.1.3 Existing Solutions . . . . . . . . . . . . . . . . . . . . .
16
3.2 FAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 17
3.2.1 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . .
. . 33
3.3.1 Effects of Filtering . . . . . . . . . . . . . . . . . . . .
35
3.3.2 Effects of Sampling . . . . . . . . . . . . . . . . . . . .
38
3.3.3 Utility vs. Privacy . . . . . . . . . . . . . . . . . . . .
41
3.3.4 Detection and Correlation . . . . . . . . . . . . . . . .
43
4 Analyzing Individual Privacy Risk for Releasing Long Aggregate
Series 45
4.1 Differentially Private Anomaly Detection . . . . . . . . . . .
. 45
4.1.1 Privacy Risk Definition . . . . . . . . . . . . . . . . . .
45
4.1.2 Epidemic Outbreak Detection . . . . . . . . . . . . . .
48
4.1.3 Sensitivity Analysis. . . . . . . . . . . . . . . . . . . .
49
4.1.4 Laplace Perturbation . . . . . . . . . . . . . . . . . . .
50
4.1.5 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . .
. . 52
4.2.1 Accuracy of Released Data . . . . . . . . . . . . . . . .
54
4.2.2 Outbreak Detection Evaluation . . . . . . . . . . . . .
56
5 Modeling Spatio-Temporal Correlation of Multi-Variate Aggregate
Time Series 59
5.1 Differentially Private Web Monitoring . . . . . . . . . . . . .
. 59
5.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . .
59
5.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . .
. . 62
5.2.1 Univariate Time Series Approach . . . . . . . . . . . .
63
5.2.2 Multivariate Time Series Approach . . . . . . . . . . .
65
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . .
. . 68
5.3.1 Simulating Dynamic Browsing Sessions . . . . . . . . .
69
5.3.2 Learning Models . . . . . . . . . . . . . . . . . . . . .
71
5.3.3 Utility Evaluation . . . . . . . . . . . . . . . . . . . . .
72
5.3.4 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.4 Differentially Private Traffic Monitoring . . . . . . . . . . .
. 79
5.4.1 Problem Statement . . . . . . . . . . . . . . . . . . . .
79
5.4.2 Baseline Solution - LPA Algorithm . . . . . . . . . . .
80
5.5 Proposed Solutions . . . . . . . . . . . . . . . . . . . . . .
. . 81
5.5.1 Temporal Estimation . . . . . . . . . . . . . . . . . . .
82
5.5.2 Spatial Estimation . . . . . . . . . . . . . . . . . . . .
83
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . .
. . 86
5.6.1 Parameter Impacts . . . . . . . . . . . . . . . . . . . .
89
5.6.2 Utility Performance . . . . . . . . . . . . . . . . . . . .
91
5.6.3 Runtime Performance . . . . . . . . . . . . . . . . . . .
93
6 Conclusion and Future Work 95
6.1 Summary of Dissertation . . . . . . . . . . . . . . . . . . . .
. 96
6.2 Recommendations for Future Work . . . . . . . . . . . . . . .
99
Appendix 101
Bibliography 106
About this Dissertation
Rights statement
- Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School |
|
Department |
|
Degree |
|
Submission |
|
Language |
|
Research Field |
|
关键词 |
|
Committee Chair / Thesis Advisor |
|
Committee Members |
|