GASP: Graph-based Approximate Sequential Pattern Mining Público

Dong, Wenqin (Spring 2020)

Permanent URL: https://etd.library.emory.edu/concern/etds/rj430579w?locale=pt-BR
Published

Abstract

The rapid growth of data capturing sequential ordering information has led sequential pattern mining to become an essential data mining task. In sequential pattern mining, the goal is to discover frequent and useful patterns from sequences in the database. However, conventional algorithms (or exact sequential pattern mining algorithms) that discover all frequent sequential patterns generate a large number of patterns and incur a high computational and memory footprint. Thus, approximate sequential pattern mining techniques have been introduced. Yet, existing approximate methods fail to reflect the true frequent sequential patterns or only target singe-item event sequences. Multi-item event sequences are prominent in real-world applications. As an example, sequential pattern mining of electronic health records where a patient can have multiple interventions or diagnoses at the same visit. To alleviate these issues, we propose GASP, a graph-based approximate sequential pattern mining, that discovers frequent patterns not only on single-item event sequences but also multi-item event sequences. Our approach compresses the sequential information into a concise graph structure which results in a significantly smaller memory footprint. We assess the performance of GASP with both singe-item and multi-item event sequence datasets on computation time, memory usage, and recoverability of frequent patterns. The empirical results suggest that GASP outperforms existing approximate models by achieving better recoverability and outperforms existing exact models on computation time and memory usage.

Table of Contents

1 Introduction 2

2 Background 7

2.1 Definition and Notation.............................. 7

2.2 Exact Sequential Pattern Mining......................... 8

2.3 Approximate Sequential Pattern Mining .................... 10

3 GASP 12

3.1 Subsequence Generation ............................. 13

3.2 Graph Construction................................ 16

3.2.1 Start Probability.............................. 17

3.2.2 Event Transition Probability....................... 18

3.2.3 Edge-Based Ending Probability ..................... 19

3.2.4 Length-Based Ending Probability.................... 20

3.2.5 Graph Construction Algorithm ..................... 21

3.3 Random Walk ................................... 23

3.4 ImplementationDetails.............................. 26 

4 Experiment Setup 28

4.1 Dataset....................................... 28

4.2 ExperimentalDesign ............................... 30

4.2.1 Approximate Sequential Pattern Mining . . . . . . . . . . . . . . . . 30

4.2.2 Exact Sequential Pattern Mining .................... 31

4.3 Evaluation Metric ................................. 33

4.3.1 Frequent Patterns Recoverability .................... 33

4.3.2 Computational Time ........................... 34

4.3.3 Memory Usage............................... 35

5 Empirical Results 36

5.1 Evaluation with Approximate Sequential Pattern Mining . . . . . . . . . . . 36

5.1.1 CMS(date)Dataset ............................ 36

5.1.2 FIFADataset................................ 37

5.2 Evaluation with Exact Sequential Pattern Mining . . . . . . . . . . . . . . . 37

5.2.1 CMS(interval) Dataset .......................... 37

5.2.2 CMS(date) Dataset ............................ 39

5.2.3 FIFA Dataset................................ 41

6 Conclusions ................................ 43

Bibliography ................................ 45

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Palavra-chave
Committee Chair / Thesis Advisor
Committee Members
Última modificação

Primary PDF

Supplemental Files