Abstract
This thesis introduces a subtask of entity linking, called
character identification, that maps mentions in multiparty
conversation to their referent characters. Transcripts of TV shows
are collected as the sources of our corpus and automatically
annotated with mentions by linguistically-motivated rules. These
mentions are manually linked to their referents and disambiguate
with abstract referent labels through crowdsourcing. Our annotated
corpus comprises 448 scenes from 2 seasons and 46 episodes of the
TV show Friends, and shows the inter-annotator agreement of κ
= 79.96. For statistical modeling, this task is reformulated as
coreference resolution, and experimented with two state-of-the-art
systems on our corpus. A novel mention-to-mention ranking model is
proposed to provides better mention and mention-pair
representations learned from feature groupings of dialogue-specific
features After linking coreferent clusters to their referent entity
with our proposed rule-based remapping algorithm, the best model
gives a purity score of 57.27% on average, which is promising given
the challenging nature of this task and our corpus.
Table of Contents
1 Introduction
1.1 Task definition
1.2 Motivation
1.3 Objectives
2 Background
2.1 Entity linking
2.2 Speaker identification
2.3 Conversation Corpora
2.4 Neural network
2.4.1 Word2Vec and Word Embeddings
2.4.2 Convolutional Neural Network
2.5 Coreference resolution
2.5.1 Evaluation Metrics
2.5.2 Stanford Sieve System
2.5.3 Stanford Neural System
2.5.4 Harvard Neural System
3 Corpus
3.1 Corpus creation
3.1.1 Data collection
3.1.2 Mention detection
3.1.3 Corpus annotation
3.1.4 Corpus adjudication
3.1.5 Corpus disambiguation
3.2 Corpus analysis
3.2.1 Annotation results
3.2.2 Mention Detection Error
3.2.3 Annotation Disagreement
4 Approaches
4.1 Data Formulation
4.1.1 Data Split
4.1.2 Data formats
4.2 Coreference resolution
4.2.1 CNN mention-to-mention model
4.3 Character identification
4.3.1 Rule-based entity linker
5 Experiments
5.1 Task analysis
5.1.1 Task feasibility
5.1.2 Rule-based vs. statistical model
5.1.3 Episode-delim vs. scene-delim
documents
5.1.4 Learning past vs. future
conversations
5.2 Coreference resolution
5.2.1 Stanford neural system
5.2.2 Harvard neural system
5.2.3 CNN mention-to-mention ranking model
5.3 Character identification
5.3.1 Linking rules
5.3.2 Cluster remapping
6 Conclusion
About this Honors Thesis
Rights statement
- Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School |
|
Department |
|
Degree |
|
Submission |
|
Language |
|
Research Field |
|
关键词 |
|
Committee Chair / Thesis Advisor |
|
Committee Members |
|