Beyond Audio: Advancing Speaker Diarization with Text-based Methodologies and Comprehensive Evaluation Pubblico

Wu, Peilin (Spring 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/rb68xd387?locale=it

Published

Abstract

This thesis introduces a novel approach to Speaker Diarization (SD), diverging from the traditional reliance on audio signals by exclusively leveraging text-based methodologies. It includes comprehensive evaluation methods tailored to textual data. By employing the T5-3B model within both the Single Prediction Model (SPM) and Multiple Prediction Model (MPM) frameworks, and incorporating data processing pipelines designed to enhance the model's performance on transcripts generated by Automatic Speech Recognition (ASR) models, this study assesses the feasibility and effectiveness of text-based SD in distinguishing "who speaks what" across various two-speaker dialogues via sentence-level Speaker Change Detection and aggregation mechanism. Furthermore, this research proposes and validates two new evaluation metrics: the Text-based Diarization Error Rate (TDER) and Diarization F1 (DF1). These metrics are specifically tailored to address the unique challenges of text-based SD and the joint assessment of ASR and SD errors. Alongside these metrics, we also propose a sequence alignment algorithm designed to align different transcripts effectively and efficiently, particularly in situations with overlapping speech.

Experiments conducted on a curated dataset, which encompasses 7 open-domain conversational contexts, demonstrate that text-based methods can perform comparably to—and, notably, for short conversations under 15 minutes, even outperform—traditional audio-based diarization systems by 2.5\% to 10\%. The newly proposed text-based metrics, tested on the CallHome dataset through both manual inspection and error type analysis, show an enhanced ability to accurately assess the performance of text-based SD and joint ASR and SD systems in providing informative transcription results. Moreover, the proposed multiple sequence alignment algorithm achieves better alignment results (0.99 accuracy) compared to previous dynamic programming-based methods (0.92 accuracy). These findings not only challenge existing paradigms within the field of SD but also pave the way for further advancements in conversational analysis and AI, highlighting the untapped potential of textual information in SD tasks.

1 Introduction 1

1.1 Thesis and Research Questions 4

2 Background 5

2.1 Audio-based Systems 5

2.2 Utilization of Text-based Features 7

2.3 Speaker Change Detection 8

2.4 Metrics 8

2.5 Sequence Alignment Algorithm 10

3 Text-based Speaker Diarization 12

3.1 Task Overview 12

3.2 Single Prediction Model 13

3.3 Multiple Prediction Model 15

3.4 Data Processing 17

4 Text-based Speaker Diarization Evaluation 20

4.1 Text-based Metrics 20

4.1.1 Text-based Diarization Error Rate 21

4.1.2 Diarization F1 22

4.2 Aligning Transcripts 23

4.2.1 Current Alignment Limitations 23

4.2.2 Scoring Matrix Population 26

4.2.3 Backtracking 29

4.2.4 Optimization of Matrix Population 31

5 Experiments Setup 33

5.1 Datasets 33

5.2 Text-based SD Approach Experiments 34

5.2.1 Data Processing 34

5.2.2 Model 35

5.2.3 Model Evaluation 36

5.3 Text-based Metrics Experiments 37

5.3.1 Metrics Behavior 37

5.3.2 Alignment Efficacy Experiment Setup 37

5.3.3 Alignment Package: align4d 38

6 Results and Analysis 40

6.1 Model Performance 40

6.1.1 Conversational Length-based Analysis 40

6.1.2 Input Length-based Analysis 42

6.1.3 Text-based Error Types 42

6.2 Text-based Metrics 45

6.2.1 Metrics Behavior Analysis 45

6.2.2 Alignment Algorithm Analysis 47

7 Conclusion 48

Bibliography 50

About this Honors Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Emory College
Department	Computer Science
Degree	B.S.
Submission	Honors Thesis
Language	English
Research Field	Computer Science
Parola chiave	Speaker Change Detection Speaker Diarization Turn Taking Natural Language Processing
Committee Chair / Thesis Advisor	Jinho Choi, Emory University
Committee Members	Alissa Bans, Emory University Davide Fossati, Emory University

Ultima modifica

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Beyond Audio: Advancing Speaker Diarization with Text-based Methodologies and Comprehensive Evaluation ()	2024-04-21 22:29:36 -0400	Download

Beyond Audio: Advancing Speaker Diarization with Text-based Methodologies and Comprehensive Evaluation Pubblico

Wu, Peilin (Spring 2024)

Abstract

Table of Contents

About this Honors Thesis

Primary PDF

Supplemental Files