Leveraging Large Language Models for Loneliness Detection and Analysis Open Access

Kim, Michelle (Fall 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/jd472x736?locale=en
Published

Abstract

This research investigates the application of Large Language Models (LLMs) in measuring and analyzing loneliness in the caregiver and non-caregiver populations to enable building diverse social media datasets to study loneliness across the two populations and better understand their experiences of loneliness.

Firstly, this research applies GPT-4o, GPT-5-nano, and GPT-5 to evaluate and detect high quality Reddit posts from 15 subreddits. We developed an expert-developed framework to measure loneliness and an expert-informed cause of loneliness typology framework to identify and categorize causes of loneliness across populations. This complete data processing pipeline is validated with human annotation and resulted in a validated data processing pipeline that judges a given post’s relevance, measures the author’s loneliness, extracts and categorizes the author’s cause of loneliness, and extracts demographic information.

We find that LLMs are able to be successfully applied to measure loneliness via a psychologically grounded framework in the caregiver and non-caregiver populations, achieving 76.09% and 79.78% average accuracy respectively. Additionally, we find that LLMs are able to effectively apply the cause of loneliness categorization framework on high-quality Reddit posts, achieving high micro-F1 scores of 0.825 and 0.8 in the caregiver and non- caregiver populations, respectively. We find that the distribution of cause categories strongly differs across the two populations, suggesting our dataset and framework captures differences between the two populations. We find that the perceived causes of loneliness between the two populations highly differ, with caregiver’s loneliness predominately originating from their role as caregivers, demonstrating the loneliness experiences between the two populations are distinct. Through applying these validated frameworks, we successfully created a dataset of high quality posts for both populations. Through demographic data extraction, we find that Reddit data is viable for building a diverse dataset across 6 demographic categories in the caregiver population. This work contributes to understanding caregiver and non-caregiver loneliness by establishing a LLM-based data processing pipeline for sourcing high quality and diverse social media data and demonstrating successfully application of LLMs to analyze differences in the loneliness of the two populations. 

Table of Contents

1 Introduction 1

1.1 Motivation..................................... 1

1.2 Research Questions and Hypotheses....................... 3

1.3 Research Contributions.............................. 4

1.4 Thesis Statement ................................. 4

2 Background 5

2.1 Defining Loneliness................................ 5

2.2 Measuring Loneliness............................... 6

2.3 Categorizing Loneliness.............................. 6

2.4 Understanding Caregiver Loneliness....................... 7

2.5 NLP Approaches to Analyzing Loneliness.................... 8

2.6 Prior Works.................................... 10

3 Approach 12

3.1 Data Collection.................................. 12

3.1.1 Cross-population Contamination .................... 13

3.2 Data Preparation ................................. 13

3.3 Model Selection.................................. 14

3.3.1 GPT-4o .................................. 14

3.3.2 GPT-5-nano................................ 14

3.3.3 GPT-5................................... 14

3.4 Annotation Procedure .............................. 15

3.5 Loneliness Evaluation Framework ........................ 15

3.5.1 Measuring Loneliness and Post Quality................. 16

3.6 Causes of Loneliness Evaluation framework................... 18

3.6.1 Metrics for Cause Categorization .................... 20

3.7 Preprocessed Dataset............................... 21

3.7.1 Caregiver subreddits ........................... 21

3.7.2 Non-caregiver subreddits......................... 21

4 Results 23

4.1 Relevance ..................................... 23

4.1.1 Model Comparison for Relevance .................... 23

4.1.2 Application to Dataset .......................... 23

4.2 Evaluating loneliness ............................... 24

4.2.1 Caregiver Population Performance.................... 24

4.2.2 Non-caregiver Population Performance . . . . . . . . . . . . . . . . . 26

4.2.3 Application to Full Dataset ....................... 29

4.3 Categorizing Causes of Loneliness........................ 30

4.3.1 Caregiver Population Performance.................... 31

4.3.2 Non-caregiver Population Performance . . . . . . . . . . . . . . . . . 33

4.3.3 Full Dataset Results ........................... 34

4.4 Demographic Data Extraction.......................... 34

5 Analysis 36

5.1 Loneliness Evaluation............................... 36

5.1.1 Prompting Strategy............................ 36

5.1.2 Error Analysis............................... 37

5.1.3 Implications................................ 39

5.2 Categorizing Causes of Loneliness........................ 40

5.2.1 Prompting Strategy............................ 40

5.2.2 Error Analysis............................... 42

5.2.3 Implications of Differences in the Distribution of Types of Causes of Loneliness................................. 43

5.3 DemographicAnalysis .............................. 44

5.3.1 Implications................................ 45

6 Conclusion 46

6.1 Future Directions................................. 46

6.2 Conclusion..................................... 47

A Appendix 48

A.1 Loneliness Evaluation Framework ........................ 48

A.2 Cause Categorization Annotation Guidelines . . . . . . . . . . . . . . . . . . 52

A.3 Relevance Prompt................................. 55

A.4 Loneliness Evaluation Prompt.......................... 55

A.5 Cause Categorization Prompt .......................... 57

A.6 Loneliness Score Distribution .......................... 60

A.6.1 Caregiver Subreddits ........................... 60

A.6.2 Non-caregiver Subreddits......................... 61

A.7 Caregiver Subreddits Demographics....................... 62

A.7.1 CaregiverAge............................... 62

A.7.2 Caregiving Duration ........................... 62

A.7.3 Caregiver Gender............................. 63

A.7.4 Caregiver Relationship with Patient, per Patient . . . . . . . . . . . . 63

A.7.5 Patient Age ................................ 64

A.7.6 Caregiver Category by Patient Diagnosis . . . . . . . . . . . . . . . . 64

A.7.7 Github Repository ............................ 65

Bibliography 66 

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files