Joint Text and Audio Multi-modal Speaker Diarization Open Access

Li, Mutian (Spring 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/gt54kp696?locale=en
Published

Abstract

Speaker diarization is a speech processing task that aims to determine the timeline and content of each speaker. Despite significant advances in deep learning, speaker diarization continues to be challenging, particularly in scenarios involving short utterances and noisy audio environments.

 This thesis analyzes the frequent error types of audio-based speaker diarization models and proposes a novel approach to addressing these errors using GPT-4o. Then, we examine whether speech segment clustering is a feasible choice. Finally, we introduce a novel approach of fine-tuning a speech recognition model, Whisper, for the speaker diarization task.

 Our research begins with the Trauma Interview dataset, collected by Emory Health, utilizing annotations generated by the Azure tool. The speaker label error rate of the Azure annotations is estimated to be 7.35%. Open-source models perform even worse; for instance, the speaker label error rate of Pyannote Powerset on the Trauma Interview dataset is 9.36%. By leveraging GPT-4o for error correction, the final speaker label error rate is reduced to an estimated 4.26%. Although the proposed fine-tuned Whisper model does not outperform the baseline, it shows potential for improvement with further refinement.

 Experiments are conducted not only on Trauma Interview data from our laboratory but also on publicly available datasets, including AMI and DailyTalk. 

 The findings of this research highlight the limitations of current speaker diarization research, and provide a direction for future research.

Table of Contents

1 Introduction

1.1 Motivation

1.2 Overview

1.3 Research Questions

1.4 Thesis Statement

2 Background

2.1 Related Works

2.1.1 Audio-based Speaker Diarization Models

2.1.2 Automatic Speech Recognition Models

2.1.3 Joint ASR and SD Models

2.1.4 Speaker Diarization Correction Models

2.1.5 Multi-modal Speaker Diarization Models

2.2 Evaluation Metrics

2.2.1 Diarization Error Rate

2.2.2 Word-level Diarization Error Rate

2.2.3 Text-based Diarization Error Rate

2.2.4 Diarization F1

2.3 Dataset

2.3.1 Trauma Interview

2.3.2 Other dataset

3 Speaker Diarization Models and Error Analysis

3.1 Text-only Speaker Diarization Models

3.2 Audio-based Speaker Diarization Models

3.2.1 DiaPer

3.2.2 Pyannote Powerset

3.3 Post-processing

3.4 Error Analysis

3.5 Discussion

4 Speaker Diarization Error Corretcion and Multi-modal Model Development

4.1 Error Correction with ChatGPT

4.1.1 Motivation

4.1.2 Prompt

4.1.3 Post-processing

4.1.4 Evaluation Result

4.2 Multi-modal Model Development

4.2.1 Multi-modal Segmentation

4.2.2 Speech Segment Clustering

4.3 Discussion

4.3.1 Future Exploration of LLM Speaker Label Fixing

4.3.2 Speaker Change Detection to Improve Speech Segment Clustering

5 Fine-tuning Automatic Speech Recognition Model for Speaker Diarization

5.1 Model

5.1.1 Speaker Diarization Embedding

5.1.2 Output Format

5.1.3 Sliding Window

5.2 Results and Discussion

5.2.1 Baseline Model

5.2.2 Evaluation Results

5.2.3 Discussion

6 Conclusion and Future Work

6.1 Conclusion

6.2 Future Directions

6.2.1 Speaker Diarization on Short Utterances

6.2.2 Whisper Finetuning for General Speaker Diarization Error Correction

6.2.3 Experiment of Whisper Finetuning on 2-Speaker Conversation

Bibliography

About this Master's Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Last modified

Primary PDF

Supplemental Files