Big Data goes Personal: Privacy and Social Challenges Open Access

Bonomi, Luca (2015)

Permanent URL: https://etd.library.emory.edu/concern/etds/6q182k87k?locale=en
Published

Abstract

The Big Data phenomenon is posing new challenges in our modern society. In addition to requiring information systems to effectively manage high-dimensional and complex data, the privacy and social implications associated with the data collection, data analytics, and service requirements create new important research problems. First, the high volume of personal data generated by users' devices (e.g. credit card transactions, GPS trajectories from mobile devices and medical data) can be used, much like a fingerprint, to identify the person who created it with the risk of disclosing sensitive information such as: political inclination, financial status and medical condition. Second, popular social networks (e.g. Facebook, Foursquare, Yelp) not only enable users to share locations and preference but also create opportunities for them to establish complex interactions (e.g. forming communities, planning trip). This creates the needs for location based services to provide services to groups of users rather than individuals. In this dissertation, we present effective solutions for both these privacy and social challenges. In the privacy domain, we propose new privacy preserving techniques to provide individual users with formal guarantee of privacy while at the same time preserve meaningful information of the data released. We demonstrate the effectiveness of our solutions in different domains such as: sequential pattern mining, record linkage, and computation of statistics over data streams. In the social domain, we propose a new type of group query aiming to find a route that all users can traverse while maximizing the group preference for the locations jointly visited. The ability of solving such query can greatly benefit many existing and emerging tools that allow users to share route information (e.g. Uber, Waze) and plan group outings or trips (e.g. QuickCliqs). Extensive empirical studies demonstrate the effectiveness of our solutions and provide us with important insights for future research directions.

Table of Contents

1 Introduction. 1

1.1 Motivation. 1

1.1.1 Privacy Needs. 3

1.1.2 Social Needs. 4

1.2 Research Contributions. 5

1.2.1 Sequential pattern mining (Chapter 3). 6

1.2.2 Record Linkage (Chapter 4). 7

1.2.3 Data anlytics over streams (Chapter 5). 8

1.2.4 Group Route Query (Chapter 6). 10

1.3 Organization. 11

2 Related Works. 14

2.1 Differential Privacy. 14

2.1.1 Achieving Differential Privacy. 15

2.1.2 Event-level privacy. 16

2.2 Privacy Preserving Frequent Pattern Mining. 17

2.3 Privacy Preserving Record Linkage. 18

2.4 Stream. 21

2.4.1 Longest Increasing Subsequence. 21

2.5 Spatial Queries in Location-based Services. 23

3 Frequent Pattern Mining of Sequential Data. 28

3.1 Problem Definition. 28

3.1.1 Problem Challenges. 29

3.2 A Baseline Solution. 30

3.2.1 Prefix Tree Algorithm. 31

3.3 Two-Phase Algorithm. 36

3.3.1 Model-based Prefix Tree Miner. 37

3.3.2 Error Analysis for Prefix Tree. 42

3.3.3 Transformation & Refinement. 44

3.4 Analysis. 49

3.4.1 Complexity Analysis. 49

3.4.2 Privacy Analysis. 50

3.5 Experiments. 52

3.5.1 Impact of the parameters on the utility. 53

3.5.2 Comparison for mining frequent patterns. 54

3.6 Conclusion. 59

4 Privacy Preserving Record Linkage. 63

4.1 Preliminaries. 63

4.1.1 Basic Definitions. 63

4.2 Overview of Proposed Solution. 64

4.3 Mining Phase. 68

4.3.1 Base Generation. 69

4.4 Embedding Phase. 70

4.4.1 Embedding. 70

4.4.2 Impact of the Grams. 71

4.5 Matching Phase. 75

4.5.1 Global Threshold. 76

4.5.2 Personalized Threshold. 78

4.6 Security Analysis. 80

4.6.1 Adversary Model. 81

4.6.2 Security against one adversary. 81

4.6.3 Security against collusion. 82

4.7 Experiments. 84

4.7.1 Embedding Performance. 84

4.7.2 Mining Performance. 88

4.7.3 Linking Performance. 89

4.7.4 Security. 93

4.8 Conclusion. 95

5 Analytics over Data Stream. 99

5.1 Differentially Private Computation of the LIS - A Baseline Approach. 99

5.2 Decomposition Framework. 100

5.2.1 Binary Decomposition. 102

5.3 Hierarchy Mechanism. 108

5.4 Summary of Results. 114

5.4.1 Extensions. 115

5.5 Conclusions. 116

6 Group Trip Planning Query. 118

6.1 Problem Definition. 118

6.2 Algorithms. 122

6.2.1 Preprocessing Step. 123

6.2.2 Dynamic Programming. 126

6.2.3 Approximation Algorithm. 133

6.2.4 Greedy Algorithm. 135

6.3 Extensions of our Solutions. 137

6.4 Experiments. 139

6.4.1 Settings. 140

6.4.2 Results on Real Dataset. 142

6.4.3 Result on Synthetic Dataset. 151

6.4.4 Implementation of Extensions. 153

6.5 Related Work. 155

6.6 Conclusion. 157

7 Conclusion and Future Work. 164

7.1 Summary. 165

7.1.1 Privacy Contributions. 165

7.1.2 Social Contributions. 166

7.2 Future Work. 167

Appendix. 169

8.1 Statistical Tools for Multiple Laplace Random Variables. 169

Bibliography. 170

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files