Attention-enhanced Deep Learning Models for Data Cleaning and Integration Público
Zhang, Jing (Spring 2023)
Abstract
Data cleaning and integration is an essential process for ensuring the accuracy and consistency of data used in analytics and decision-making. Schema matching and entity matching tasks are crucial aspects of this process to merge data from various sources into a single, unified view. Schema matching seeks to identify and resolve semantic differences between two or more database schemas whereas entity matching seeks to detect the same real-world entities in different data sources. Given recent deep learning trends, pre-trained transformers have been proposed to automate both the schema matching and entity matching processes. However, existing models only utilize the special token representation (e.g., [CLS]) to predict matches and ignore rich and nuanced contextual information in the description, thereby yielding suboptimal matching performance. To improve performance, we propose the use of the attention mechanism to (1) learn the schema matches between source and target schemas using the attribute name and description, (2) leverage the individual token representations to fully capture the information present in the descriptions of the entities, and (3) jointly utilize the attribute descriptions and entity descriptions to perform both schema and entity matching.
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Attention-over-Attention Deep Learning Schema Matching Model 4
1.2.2 Multi-Task Learning with Attention-over-Attention for Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Cross-Attention Multi-task Learning for Schema and Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Single Task Deep Learning Models . . . . . . . . . . . . . . . 11
2.2.2 Multi-Task Deep Learning Models . . . . . . . . . . . . . . . . 12
2.3 Common Deep Learning Models used for Embedding Schema and Entities 13
2.3.1 Bidirectional LSTM Network . . . . . . . . . . . . . . . . . . 13
2.3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Attention-over-Attention (AOA) . . . . . . . . . . . . . . . . . 15
2.4.2 Cross attention . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Attention-over-Attention Deep Learning Schema Matching Model 17
3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.3 Input Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.4 BiLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.5 Attention-over-Attention . . . . . . . . . . . . . . . . . . . . . 22
3.1.6 Data Augmentation & Controlled Batch Sample Ratio . . . . 25
3.2 OMAP: A New Benchmark Dataset . . . . . . . . . . . . . . . . . . 25
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.1 Correct prediction from all methods . . . . . . . . . . . . . . . 32
3.6.2 Correct prediction from only SMAT . . . . . . . . . . . . . . . . 33
3.6.3 Incorrect prediction from all models . . . . . . . . . . . . . . . 33
4 Multi-Task Learning with Attention-over-Attention for Entity Matching 35
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.3 Entity Identifier Prediction . . . . . . . . . . . . . . . . . . . . 38
4.1.4 Attention-over-Attention for Entity Matching Prediction . . . 39
4.1.5 Dual Objective Training . . . . . . . . . . . . . . . . . . . . . 41
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Auxiliary Tasks Analysis . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Statistics Analysis . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Cross-Attention Multi-task Learning for Schema and Entity Matching 57
5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1.3 Dual Objective Training . . . . . . . . . . . . . . . . . . . . . 64
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6 Conclusion and Future Work 80
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Bibliography 83
About this Dissertation
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Palabra Clave | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Attention-enhanced Deep Learning Models for Data Cleaning and Integration () | 2023-04-03 23:09:38 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|