Detectability, Interpretability, and the Limits of Machine Learning in High-Dimensional Physical Systems Restricted; Files Only

Swain, Arabind (Spring 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/sn00b022v?locale=en
Published

Abstract

In recent years, large-dimensional datasets have become increasingly common in physics, arising from simulations and experiments that capture complex systems across space and time. These datasets offer new opportunities for discovery but also pose significant challenges in separating meaningful physical structure from irrelevant correlations and statistical noise. This dissertation investigates the use of machine learning (ML) methods to uncover underlying physical structure in high-dimensional systems, focusing on two central challenges: interpreting ML predictions in complex glassy systems, and developing a theoretical foundation for understanding statistical significance of correlations between large datasets when conducting individual marginal covariance, joint covariance and cross-covariance analysis. In the context of glassy dynamics, where traditional approaches struggle due to the absence of clear structural order, ML classifiers such as Support Vector Machines (SVMs) have been shown to accurately predict local rearrangements of particles. However, using simple toy models and simulations, this work demonstrates that commonly used indicators—such as high classification accuracy, apparent Arrhenius scaling, or distance to hyperplanes—are not sufficient to guarantee that the ML model has captured meaningful physical quantities which is the size of the energy barriers in this case. This raises important questions about the inverse problem: under what conditions can interpretable physics be extracted from statistical learning models? To address broader issues of signal detection in high-dimensional data, the dissertation extends the well-known Marchenko-Pastur (MP) distribution from covariance to cross-covariance matrices. An exact analytical expression is derived for the distribution of singular values arising purely from noise-noise correlations, providing a null model for detecting shared structure between two large datasets. Furthermore, the work establishes a BBP-type (Baik–Ben Arous–Péché) detectability phase transition for cross-covariance and joint-covariance matrices, identifying critical thresholds for when rank-1 signals become distinguishable from noise, and showing that joint and cross-covariance methods can detect weaker signals—or do so with fewer samples—than individual marginal covariance based analysis. Altogether, this dissertation provides both conceptual insight and analytical tools for understanding when

ML models truly learn the physics of the system, and how noise, dimensionality, and sample size fundamentally constrain that process.

Table of Contents

Contents

1 Introduction 1

2 Machine learning that predicts well may not learn the correct physical descriptions of glassy systems 12

3 Distribution of singular values in large sample cross-covariance matrices 39

4 Statistical properties of spiked joint covariance and cross covariance matrices 61

5 Discussion 84

Bibliography 95

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified Preview image embargoed

Primary PDF

Supplemental Files