Attention-enhanced Deep Learning Models for Data Cleaning and Integration Open Access

Zhang, Jing (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/xp68kh56x?locale=en
Published

Abstract

Data cleaning and integration is an essential process for ensuring the accuracy and consistency of data used in analytics and decision-making. Schema matching and entity matching tasks are crucial aspects of this process to merge data from various sources into a single, unified view. Schema matching seeks to identify and resolve semantic differences between two or more database schemas whereas entity matching seeks to detect the same real-world entities in different data sources. Given recent deep learning trends, pre-trained transformers have been proposed to automate both the schema matching and entity matching processes. However, existing models only utilize the special token representation (e.g., [CLS]) to predict matches and ignore rich and nuanced contextual information in the description, thereby yielding suboptimal matching performance. To improve performance, we propose the use of the attention mechanism to (1) learn the schema matches between source and target schemas using the attribute name and description, (2) leverage the individual token representations to fully capture the information present in the descriptions of the entities, and (3) jointly utilize the attribute descriptions and entity descriptions to perform both schema and entity matching.

Table of Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Attention-over-Attention Deep Learning Schema Matching Model 4

1.2.2 Multi-Task Learning with Attention-over-Attention for Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 Cross-Attention Multi-task Learning for Schema and Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 8

2.1 Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Single Task Deep Learning Models . . . . . . . . . . . . . . . 11

2.2.2 Multi-Task Deep Learning Models . . . . . . . . . . . . . . . . 12

2.3 Common Deep Learning Models used for Embedding Schema and Entities 13

2.3.1 Bidirectional LSTM Network . . . . . . . . . . . . . . . . . . 13

2.3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Attention-over-Attention (AOA) . . . . . . . . . . . . . . . . . 15

2.4.2 Cross attention . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Attention-over-Attention Deep Learning Schema Matching Model 17

3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.3 Input Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.4 BiLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.5 Attention-over-Attention . . . . . . . . . . . . . . . . . . . . . 22

3.1.6 Data Augmentation & Controlled Batch Sample Ratio . . . . 25

3.2 OMAP: A New Benchmark Dataset . . . . . . . . . . . . . . . . . . 25

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6.1 Correct prediction from all methods . . . . . . . . . . . . . . . 32

3.6.2 Correct prediction from only SMAT . . . . . . . . . . . . . . . . 33

3.6.3 Incorrect prediction from all models . . . . . . . . . . . . . . . 33

4 Multi-Task Learning with Attention-over-Attention for Entity Matching 35

4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.3 Entity Identifier Prediction . . . . . . . . . . . . . . . . . . . . 38

4.1.4 Attention-over-Attention for Entity Matching Prediction . . . 39

4.1.5 Dual Objective Training . . . . . . . . . . . . . . . . . . . . . 41

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.1 Auxiliary Tasks Analysis . . . . . . . . . . . . . . . . . . . . . 48

4.3.2 Statistics Analysis . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Cross-Attention Multi-task Learning for Schema and Entity Matching 57

5.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1.3 Dual Objective Training . . . . . . . . . . . . . . . . . . . . . 64

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2.2 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Conclusion and Future Work 80

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography 83

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files