Generative Argument Mining: Pretrained Language Models are Argumentative Text Parsers Open Access
Roytburg, Daniel (Spring 2025)
Abstract
Argument mining is a natural language processing task which imposes a rhetorical structure schema on raw text, assigning labels to argumentative sub-phrases in text and connecting identified sub-phrases together with relations. Such labels may provide text analytics describing functional components of complete arguments like claims and premises, stylistic elements such as testimonies or facts, or some other defined schema. Argument mining is part of the structure prediction task family, using the formal definitions of entity and relation extraction in order to label specific decompositions of rhetorical structure. The task has important implications - not only for applied use-cases in areas such as social media analytics, jurisprudence, and group decision- making - but also for improvement on general structure prediction methods given the unique constraints imposed by the problem.
This thesis adopts the evolving capabilities of pretrained language models to cast argument mining as a generative task. Classical argument mining approaches use discriminative classifiers which produce a distribution of predictions for each individual token or sub-phrase in an input; this requires significant, task-specific architecture to process outputs of autoencoder language models. We consider whether task-agnostic generative language models can use a structured annotation scheme to mimic classification without additional architectural decisions. To this end, we adapt such a scheme which enables models to translate raw inputs to annotated text outputs, allowing ecient parsing and extraction for necessary labels. This decision affords the flexibility to not only introduce generative argument mining systems but also evaluate a wide variety of pretrained models, labeling schemas, training environments, and task configurations.
We explore the limits of these models across four key dimensions: labeling strategies for long-span entities, comparing full token spans, numerical identifiers, and abstrac- tive summaries; encoder-decoder versus decoder-only architectures, contrasting their effectiveness in this structured prediction task; the necessity of fine-tuning for decoder- only models against few-shot in-context learning; and end-to-end extraction versus relation-only extraction, evaluating the impact of providing gold entity boundaries on relation identification. To assess model performance, we supplement traditional classification metrics with a set of criteria based on adherence to an augmented natural language output format, measuring reconstruction, entity, label, and format errors.
We find that generative models outperform current classification-based baselines by 10.41% for argumentative relations and 5.28% for argumentative component. Beyond this, our introduction of compliance allows a granular view of the failure modes of generative models in this context, revealing that while accuracy can be high, compliance errors, particularly in relation to entity coherence and label hallucination, remain significant challenges. Our exploration across model architectures suggests that while larger decoder-only models exhibit strong in-context learning capabilities, fine-tuned encoder-decoder models can achieve competitive or superior performance, especially when data is limited. Furthermore, our investigation into labeling strategies indicates a trade-off between output length and parsing complexity with accuracy, highlighting the need for more robust methods for representing long-span argumentative units. These findings contribute valuable insights into the application of generative language models for argument mining, outlining both their potential and the key areas requiring further research and development to realize fully end-to-end, high-fidelity argumentative structure prediction.
Table of Contents
1 Introduction 1
1.1 Generative Structure Prediction: where are we now? . . . . . . . . . . 1
1.2 Expanding the Scope: Beyond Conventional Benchmarks . . . . . . . 4
1.2.1 The Four Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Compliance Metrics: Bridging the Gap . . . . . . . . . . . . . . 5
1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Bitter Lesson Revisited: Scaling vs. Domain Expertise . . . . . . 7
2 Background 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Communication and Rhetorical Theory . . . . . . . . . . . . . . . . . 10
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Aristotelian Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 The Toulmin Model . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Alternative Theorists and the Structured Language Problem . . 15
2.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Argument Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Subtasks of Argument Mining . . . . . . . . . . . . . . . . . . . 17
2.3.3 Key Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 A Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.4 Joint Entity-Relation Extraction . . . . . . . . . . . . . . . . . . 27
2.4.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.6 Language Models for Joint Entity-Relation Extraction . . . . 28
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Approach 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Joint Entity-Relation Extraction . . . . . . . . . . . . . . . . . . 34
3.2.2 Generation v. Classification . . . . . . . . . . . . . . . . . . . 35
3.2.3 The TANL Framework . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Key Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Argument Annotation Strategy . . . . . . . . . . . . . . . . . . . 37
3.3.2 Encoder-Decoder v. Decoder-only . . . . . . . . . . . . . . . . 37
3.3.3 Few-Shot v. Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4 End-to-End v. Relation-only . . . . . . . . . . . . . . . . . . . 39
3.4 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Experiments 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.1 Argument Labeling Strategy . . . . . . . . . . . . . . . . . . . 47
4.4.2 Encoder-Decoder v. Decoder-only . . . . . . . . . . . . . . . . 49
4.4.3 End-to-End v. Relation-only . . . . . . . . . . . . . . . . . . . 51
4.4.4 Few-Shot v. Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.1 Encoder-Decoder Models . . . . . . . . . . . . . . . . . . . . . 53
4.5.2 Decoder-Only Models . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Analysis 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 CDCP Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 AAEC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Comparing Encoder-Decoder Models . . . . . . . . . . . . . . 62
5.2.4 Comparing Decoder-Only Models . . . . . . . . . . . . . . . 64
5.2.5 Oracle Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 CDCP Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 AAEC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.1 Prediction of Relations v. Accuracy . . . . . . . . . . . . . . 75
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Conclusion 78
6.1 Across Four Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1.1 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1.2 Key Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Implications for Argument Mining . . . . . . . . . . . . . . . . . . . 82
6.3 Implications for Structure Prediction . . . . . . . . . . . . . . . . . 83
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4.1 Encoder-Decoder v. Decoder-Only Models . . . . . . . . . . . 85
6.4.2 Further Model Dimensions . . . . . . . . . . . . . . . . . . . . 86
6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Bibliography 89
About this Honors Thesis
| School | |
|---|---|
| Department | |
| Degree | |
| Submission | |
| Language | 
 | 
| Research Field | |
| Keyword | |
| Committee Chair / Thesis Advisor | |
| Committee Members | 
Primary PDF
| Thumbnail | Title | Date Uploaded | Actions | 
|---|---|---|---|
|  | Generative Argument Mining: Pretrained Language Models are Argumentative Text Parsers () | 2025-04-18 17:56:50 -0400 |  | 
Supplemental Files
| Thumbnail | Title | Date Uploaded | Actions | 
|---|