Joint Text and Audio Multi-modal Speaker Diarization Öffentlichkeit
Li, Mutian (Spring 2025)
Abstract
Speaker diarization is a speech processing task that aims to determine the timeline and content of each speaker. Despite significant advances in deep learning, speaker diarization continues to be challenging, particularly in scenarios involving short utterances and noisy audio environments.
This thesis analyzes the frequent error types of audio-based speaker diarization models and proposes a novel approach to addressing these errors using GPT-4o. Then, we examine whether speech segment clustering is a feasible choice. Finally, we introduce a novel approach of fine-tuning a speech recognition model, Whisper, for the speaker diarization task.
Our research begins with the Trauma Interview dataset, collected by Emory Health, utilizing annotations generated by the Azure tool. The speaker label error rate of the Azure annotations is estimated to be 7.35%. Open-source models perform even worse; for instance, the speaker label error rate of Pyannote Powerset on the Trauma Interview dataset is 9.36%. By leveraging GPT-4o for error correction, the final speaker label error rate is reduced to an estimated 4.26%. Although the proposed fine-tuned Whisper model does not outperform the baseline, it shows potential for improvement with further refinement.
Experiments are conducted not only on Trauma Interview data from our laboratory but also on publicly available datasets, including AMI and DailyTalk.
The findings of this research highlight the limitations of current speaker diarization research, and provide a direction for future research.
Table of Contents
1 Introduction
1.1 Motivation
1.2 Overview
1.3 Research Questions
1.4 Thesis Statement
2 Background
2.1 Related Works
2.1.1 Audio-based Speaker Diarization Models
2.1.2 Automatic Speech Recognition Models
2.1.3 Joint ASR and SD Models
2.1.4 Speaker Diarization Correction Models
2.1.5 Multi-modal Speaker Diarization Models
2.2 Evaluation Metrics
2.2.1 Diarization Error Rate
2.2.2 Word-level Diarization Error Rate
2.2.3 Text-based Diarization Error Rate
2.2.4 Diarization F1
2.3 Dataset
2.3.1 Trauma Interview
2.3.2 Other dataset
3 Speaker Diarization Models and Error Analysis
3.1 Text-only Speaker Diarization Models
3.2 Audio-based Speaker Diarization Models
3.2.1 DiaPer
3.2.2 Pyannote Powerset
3.3 Post-processing
3.4 Error Analysis
3.5 Discussion
4 Speaker Diarization Error Corretcion and Multi-modal Model Development
4.1 Error Correction with ChatGPT
4.1.1 Motivation
4.1.2 Prompt
4.1.3 Post-processing
4.1.4 Evaluation Result
4.2 Multi-modal Model Development
4.2.1 Multi-modal Segmentation
4.2.2 Speech Segment Clustering
4.3 Discussion
4.3.1 Future Exploration of LLM Speaker Label Fixing
4.3.2 Speaker Change Detection to Improve Speech Segment Clustering
5 Fine-tuning Automatic Speech Recognition Model for Speaker Diarization
5.1 Model
5.1.1 Speaker Diarization Embedding
5.1.2 Output Format
5.1.3 Sliding Window
5.2 Results and Discussion
5.2.1 Baseline Model
5.2.2 Evaluation Results
5.2.3 Discussion
6 Conclusion and Future Work
6.1 Conclusion
6.2 Future Directions
6.2.1 Speaker Diarization on Short Utterances
6.2.2 Whisper Finetuning for General Speaker Diarization Error Correction
6.2.3 Experiment of Whisper Finetuning on 2-Speaker Conversation
Bibliography
About this Master's Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Stichwort | |
Committee Chair / Thesis Advisor |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
|
Joint Text and Audio Multi-modal Speaker Diarization () | 2025-04-04 11:04:57 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|