Exploring Invariance in Single and Multi-modal Deep Representation Learning Open Access

Lin, Rongmei (Spring 2022)

Permanent URL: https://etd.library.emory.edu/concern/etds/8c97kr69b?locale=en
Published

Abstract

The recent successes in artificial intelligence have been largely attributed to the powerful and rich representation from deep neural networks. General-purpose representation learning has been well studied in the past decade. The ultimate goal of representation learning is to achieve a certain level of invariance. For example, for generic image recognition, we aim to learn features that are only sensitive to image labels and invariant to the intra-class variations such as backgrounds, object poses. We propose the single-modal / multi-modal generalization to handle different scenarios. Single-modal generalization is the classic setting of supervised learning where the training and testing data are drawn from the same distribution. Most deep neural network architectures and generalization techniques are designed towards this end. In contrast, multi-modal generalization considers the problem where the data are drawn from different modalities such as image/text or multi-sensor healthcare data. These variant modalities are differently represented yet complement each other simultaneously. It is worth considering the interactions between modalities rather than simply concatenating the information. My thesis focuses on the topic of learning task-driven invariant representations and the contributions can be summarized as follows: 1) We introduce an unified regularizer for invariant representation learning by promoting the angular diversity of neurons; 2) We propose a framework to fuse multimodal data homogeneously and learn features that are invariant to specific modality; 3) We further extend the multimodal framework with pre-training tasks on extensive vision-language and healthcare tasks, which leads to significant performance improvement. 

Table of Contents

1 Introduction 1

1.1 Background................................ 1

1.2 Related Works............................... 4

1.2.1 Generalizations .......................... 4

1.2.2 Multi-modal Learning ...................... 5

1.3 Research Contributions.......................... 7

2 Single-modal Generalization on Hypersphere 10

2.1 Regularization on Hypersphere ..................... 10

2.1.1 Overview ............................. 10

2.1.2 Proposed Method......................... 15

2.1.3 Experiments and Results..................... 21

2.2 Orthogonal Trainingon Hypersphere .................. 27

2.2.1 Overview ............................. 27

2.2.2 Proposed Method......................... 29

2.2.3 Experiments Results ....................... 37

3 Multi-modal Learning on Vision-and-Language Tasks 43

3.1 Problem Definition and Challenges ................... 43

3.1.1 Overview ............................. 43

3.1.2 Challenges............................. 45

3.1.3 Problem Definition ........................ 47

3.2 Domain-Aware Attribute Extraction Model. . . . . . . . . . . . . . . 48

3.2.1 Proposed Method......................... 48

3.2.2 Experiments Results ....................... 53

3.2.3 Conclusions ............................ 60

4 Multi-modal Learning with Pre-training Tasks on Healthcare Data 62

4.1 Introduction................................ 62

4.2 Proposed Method............................. 64

4.2.1 Problem Formulation....................... 64

4.2.2 Multi-Sensor Fusion Framework................. 65

4.2.3 Fully-Visible vs. Causal-Prefix Masking Pattern . . . . . . . . 67

4.3 Multimodal Pre-training Tasks...................... 69

4.3.1 Masked Imputation on Each Modality (MIM) . . . . . . . . . 69

4.3.2 Contrastive Matching through Modality Replacement (MMR) 70

4.3.3 Unsupervised Matching through Data Augmentation (MDA) . 71

4.4 Applications and Experiments...................... 72

4.4.1 Clinical Applications and Data Cohorts. . . . . . . . . . . . . 72

4.4.2 Experiment Settings ....................... 74

4.4.3 Experiment Results........................ 75

4.4.4 Conclusions ............................ 77

5 Conclusions and Future Works 79

Bibliography 81 

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files