Harnessing networked, textual data with graph and language modeling Open Access

Ling, Chen (Fall 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/t148fj52q?locale=en
Published

Abstract

Networked, textual data are ubiquitous across various domains, playing a critical role in numerous applications from social network analysis to knowledge representation. These two types of data can naturally be combined, as text data can be structured like a graph, and graph data can have rich text embedded on nodes and edges. For instance, a document collection can be represented as a graph where nodes are documents and edges indicate relationships such as citations or references. Conversely, social networks often contain textual information in the form of user profiles, posts, and interactions embedded on nodes and edges. Existing works have primarily focused on either graph or textual data in isolation, often overlooking the potential synergy between the two. Combining both modalities to address unique research problems that require a holistic understanding of structural relationships and semantic content is important. Integrating these modalities can leverage the complementary strengths of graph data's structural insights and textual data's semantic richness, leading to more robust and comprehensive data mining methodologies.

Despite the advancements in graph data mining and language modeling, existing approaches that treat graph and textual data separately can introduce significant limitations. While many graph neural networks do integrate semantic information from node and edge attributes, there remain graph data mining problems—\textit{such as those addressing graph inverse problems or combinatorial optimization}—that predominantly rely on graph topology and information flow for making predictions or approximations. Conversely, language models that overlook the structural context provided by graphs may lack a framework for accurately interpreting and generating text, especially in tasks requiring a deep understanding of relationships and dependencies. This dichotomy motivates the need to bridge graph and textual data, leveraging their complementary strengths for more effective data mining.

There are three key challenges to addressing this integration. First, preserving both graph and textual information in a unified representation is challenging. This process involves creating embeddings that maintain the structural integrity of graphs while encapsulating the semantic richness of texts. Second, enhancing graph-based tasks with textual information is crucial. For instance, in source localization of information diffusion on information networks, incorporating node texts can provide additional context that improves the accuracy of identifying information diffusion sources. Third, utilizing graph structures to augment text-based tasks, such as knowledge-extensive question answering, is also essential. For example, knowledge graphs can provide a structured context that enhances the reasoning capabilities of language models. This dissertation proposal is dedicated to exploring these challenges, particularly focusing on applications of 1) integrating textual data for solving graph data mining problems like source localization of information diffusion and 2) employing graph-structured data to improve the reasoning capability of language models.

Specifically, this dissertation focuses on three primary areas: 1) learning latent embeddings that fuse both semantic and structural information of the observed network to facilitate downstream graph data mining tasks, e.g., deep graph generation, source localization, and influence maximization. 2) enhancing natural language understanding tasks by different means, such as quantifying the uncertainty of a large language model's response and employing an external knowledge base to enhance the commonsense reasoning task in a retrieval-augmented manner. 3) creating a framework that combines the semantic processing capabilities of large language models with the structural analysis strengths of graph neural networks to learn a unified representation for text-attributed graphs.

This dissertation's contributions include novel formulations and frameworks for each task, the creation of new datasets, and extensive experimental validation. This interdisciplinary approach advances the theoretical understanding of integrating graph data mining and language modeling and has practical implications for a wide range of applications in data science.

Table of Contents

1) Introduction

2) Enhancing Graph Data Mining by Exploiting Semantic Information on Networks

3) Integrating Structured Knowledge and Quantifying Uncertainty in Natural Language Understanding

4) Representation Learning of Textual-edge Graphs for Link Prediction

5) Conclusion and Future Works

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files