Evaluating Speaker Diarization in Transcripts: A Text-based Approach with the TDER Metric and the TranscribeView System Open Access

Gong, Chen (Spring 2023)

Permanent URL: https://etd.library.emory.edu/concern/etds/h415pb74w?locale=en
Published

Abstract

Speaker Diarization (SD), the task of attributing speaker labels to dialogue segments, has traditionally been performed and evaluated at the audio level. The diarization error rate (DER) metric for SD systems measures errors in time but does not account for the impact of automatic speech recognition (ASR) systems on transcript-based performance. Word error rate (WER), the evaluation metric for ASR, only considers errors in word insertion, deletion, and substitution, disregarding SD quality. To better evaluate SD performance at the text level, this paper proposes Text-based Diarization Error Rate (TDER) and diarization F1-score, which jointly assess SD and ASR performance.

To address inconsistencies in token counts between hypothesis and reference transcripts, we introduce a multiple sequence alignment tool that accurately maps words between reference and hypothesis transcripts. Our alignment method achieves 99% accuracy on a simulated corpus generated based on common SD and ASR errors.

Comparisons with DER, WER, and WDER on 10 transcripts from the CallHome dataset demonstrate that TDER and diarization F1-score provide a more reliable evaluation of speaker diarization at the text level. To enable a comprehensive evaluation of transcript quality, we present TranscribeView, a web-based platform for assessing and visualizing errors in speech recognition and speaker diarization. To the best of our knowledge, TranscribeView is the first comprehensive platform that enables researchers to align multi-sequence transcripts and assess and visualize speaker diarization errors, contributing significantly to the advancement of data-driven conversational AI research.

Table of Contents

1 Introduction 1

1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Objectives and Contributions . . . . . . . . . . . . . . . . . . 2

1.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 ASR Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Speaker Diarization Evaluation Metrics . . . . . . . . . . . . . . . . . 6

2.2.1 Diarization Error Rate (DER) . . . . . . . . . . . . . . . . . . 6

2.2.2 Word-level Diarization Error Rate (WDER) . . . . . . . . . . 8

2.3 Transcript alignment methods . . . . . . . . . . . . . . . . . . . . . . 9

3 Text-based Diarization Error Rate and F-1 score 10

3.1 Text-based Diarization Error Rate (TDER) . . . . . . . . . . . . . . 10

3.2 Diarization F-1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Multiple Sequence Alignment for Transcript Mapping 13

4.1 Limitations for Pair-wise Alignment Algorithms . . . . . . . . . . . . 13

4.2 Needleman-Wunsch algorithm . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Adaptation to 3-dimension . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . 20

5 Experiments and Results 22

5.1 Transcribers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 CallHome Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3 Evaluation of Multiple Sequence Alignment . . . . . . . . . . . . . . 24

5.3.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.4 Evaluation of Proposed Metrics . . . . . . . . . . . . . . . . . . . . . 26

5.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.2 Speaker Alignment . . . . . . . . . . . . . . . . . . . . . . . . 27

5.4.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 TranscribeView: A System for Transcript Evaluation and Diarization Error Visualization 29

6.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3 Case Study: Comparing Transcribers . . . . . . . . . . . . . . . . . . 31

7 Conclusion 34

7.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Bibliography 37

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files