Simultaneous Dimensionality Reduction: A Data Efficient Approach for Multimodal Representations Learning Open Access

Abdelaleem, Eslam (Fall 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/th83m076j?locale=en
Published

Abstract

Current experiments frequently produce high-dimensional, multimodal datasets—such as those combining neural activity and animal behavior or gene expression and phenotypic profiling—with the goal of extracting useful correlations between the modalities. Often, the first step in analyzing such datasets is dimensionality reduction. We explore two primary classes of approaches to dimensionality reduction (DR): Independent Dimensionality Reduction (IDR) and Simultaneous Dimensionality Reduction (SDR). In IDR methods, of which Principal Components Analysis is a paradigmatic example, each modality is compressed independently, striving to retain as much variation within each modality as possible. In contrast, in SDR, one simultaneously compresses the modalities to maximize the covariation between the reduced descriptions while paying less attention to how much individual variation is preserved. Paradigmatic examples include Partial Least Squares and Canonical Correlations Analysis. Even though these DR methods are a staple of statistics, their relative accuracy and data set size requirements are poorly understood. We introduce a generative linear model to synthesize multimodal data with known variance and covariance structures to examine these questions. We assess the accuracy of the reconstruction of the covariance structures as a function of the number of samples, signal-to-noise ratio, and the number of varying and covarying signals in the data. Using numerical experiments, we demonstrate that linear SDR methods consistently outperform linear IDR methods and yield higher-quality, more succinct reduced-dimensional representations with smaller datasets. Remarkably, regularized CCA can identify low-dimensional weak covarying structures even when the number of samples is much smaller than the dimensionality of the data, which is a regime challenging for all dimensionality reduction methods. Our work corroborates and explains previous observations in the literature that SDR can be more effective in detecting covariation patterns in data. These findings suggest that SDR should be preferred to IDR in real-world data analysis when detecting covariation is more important than preserving variation.

Table of Contents

Introduction Model Relations to Previous Work Linear Model with Self and Shared Signals Methods Linear Dimensionality Reduction Methods PCA PLS CCA Regularized CCA - rCCA Assessing Success and Sampling Noise Treatment Implementation Results One self-signal in X and Y in addition to the shared signal (m_self = 1) Keeping 1 dimension after reduction (|Z_X/Y| = 1) Keeping 2 dimensions after reduction (|Z_X/Y | = 2) Many self-signal in X and Y in addition to the shared signal (m_self = 30) Keeping 1 dimension after reduction (|Z_X/Y | = 1) Keeping 30 dimensions after reduction (|Z_X/Y | = 30) Keeping 31 dimensions after reduction (|Z_X/Y | = 31) Key Parameters and Testing Technique for Dimensionality of Self and Shared Signals Discussions Extensions and Generalizations Explaining Observations in the Literature Is SDR strictly effective in low sampling situations? Diagnostic Test for number of latent signals Limitations, and Future Work Linearity of the model Linearity of the methods Linearity of the metric Conclusion Bibliography

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files