GASP: Graph-based Approximate Sequential Pattern Mining Open Access
Dong, Wenqin (Spring 2020)
Abstract
The rapid growth of data capturing sequential ordering information has led sequential pattern mining to become an essential data mining task. In sequential pattern mining, the goal is to discover frequent and useful patterns from sequences in the database. However, conventional algorithms (or exact sequential pattern mining algorithms) that discover all frequent sequential patterns generate a large number of patterns and incur a high computational and memory footprint. Thus, approximate sequential pattern mining techniques have been introduced. Yet, existing approximate methods fail to reflect the true frequent sequential patterns or only target singe-item event sequences. Multi-item event sequences are prominent in real-world applications. As an example, sequential pattern mining of electronic health records where a patient can have multiple interventions or diagnoses at the same visit. To alleviate these issues, we propose GASP, a graph-based approximate sequential pattern mining, that discovers frequent patterns not only on single-item event sequences but also multi-item event sequences. Our approach compresses the sequential information into a concise graph structure which results in a significantly smaller memory footprint. We assess the performance of GASP with both singe-item and multi-item event sequence datasets on computation time, memory usage, and recoverability of frequent patterns. The empirical results suggest that GASP outperforms existing approximate models by achieving better recoverability and outperforms existing exact models on computation time and memory usage.
Table of Contents
1 Introduction 2
2 Background 7
2.1 Definition and Notation.............................. 7
2.2 Exact Sequential Pattern Mining......................... 8
2.3 Approximate Sequential Pattern Mining .................... 10
3 GASP 12
3.1 Subsequence Generation ............................. 13
3.2 Graph Construction................................ 16
3.2.1 Start Probability.............................. 17
3.2.2 Event Transition Probability....................... 18
3.2.3 Edge-Based Ending Probability ..................... 19
3.2.4 Length-Based Ending Probability.................... 20
3.2.5 Graph Construction Algorithm ..................... 21
3.3 Random Walk ................................... 23
3.4 ImplementationDetails.............................. 26
4 Experiment Setup 28
4.1 Dataset....................................... 28
4.2 ExperimentalDesign ............................... 30
4.2.1 Approximate Sequential Pattern Mining . . . . . . . . . . . . . . . . 30
4.2.2 Exact Sequential Pattern Mining .................... 31
4.3 Evaluation Metric ................................. 33
4.3.1 Frequent Patterns Recoverability .................... 33
4.3.2 Computational Time ........................... 34
4.3.3 Memory Usage............................... 35
5 Empirical Results 36
5.1 Evaluation with Approximate Sequential Pattern Mining . . . . . . . . . . . 36
5.1.1 CMS(date)Dataset ............................ 36
5.1.2 FIFADataset................................ 37
5.2 Evaluation with Exact Sequential Pattern Mining . . . . . . . . . . . . . . . 37
5.2.1 CMS(interval) Dataset .......................... 37
5.2.2 CMS(date) Dataset ............................ 39
5.2.3 FIFA Dataset................................ 41
6 Conclusions ................................ 43
Bibliography ................................ 45
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
GASP: Graph-based Approximate Sequential Pattern Mining () | 2020-04-08 22:12:37 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|