The Verbiverse: Creating a Verb Space with Comparative Methods of Distributional Semantics Público

Blodgett, Austin James (2015)

Permanent URL: https://etd.library.emory.edu/concern/etds/j38607456?locale=pt-BR
Published

Abstract

Computational semantics as a field includes many of the unsolved problems of Natural Language Processing. The need for innovation in this field has motivated much research in developing word and language models that better represent meaning and various concepts within semantics. This thesis is concerned specifically with measuring verb similarity and verb clustering, a task within this field. The goal is to develop representations of verbs that can accurately and viably be used to judge semantic similarity between verbs and to group verbs into classes that reflect their relatedness in meaning. Verb clustering - a task of distributing verbs into semantically related classes - has in previous research been shown to have applications in multiple tasks in Natural Language Processing including word sense disambiguation. This thesis will present and compare several methods of automatic acquisition of verb similarity, with a goal of allowing future applications of these methods in NLP tasks and to promote discoveries in how the mechanisms modeled by these methods relate to linguistics. This paper presents several methods from verb clustering based on Latent Dirichlet Allocation - a probabilistic graphical model commonly used for topic modelling. We model verbs as collections of contextual features derived from latent classes. LDA, which is designed as a model for Bayesian inference of latent thematic categories, fits well to model verb classes based on linguistic context. We demonstrate Recursive LDA, a procedure of executing LDA iteratively to produce a hierarchical structure of classes. We test several linguistic features from syntax and lexical arguments of verbs with interest in identifying how informative each feature is. We evaluate all of our experiments against human judgments of similarity providing a novel method for evaluating semantic similarity metrics of word models. We test all of our data on a list of 3,000 most common English verbs. We test our method against Word2Vec, a popular and recently developed word model using skip-gram feature vectors refined by deep learning. The results in this thesis will show that given the right features, our method of using LDA with linguistic features outperforms Word2Vec's data-driven statistical approach when weighed against human judgements.

Table of Contents

1. Introduction 1

2. Related Work 2

2.1 Constructing Verb Classes 4

2.2 Word2Vec 7

2.3 Comparison to Our Approach 8

3. Linguistic Context and Meaning 9

3.1 Meaning Acquisition Process 10

3.2 Syntax 11

3.3 Lexical Arguments 14

4. Modelling Lexical Semantics 15

4.1 The Instance Space 16

4.2 Understanding Meaning 19

4.3 Properties of Words, Senses, and Categories 21

4.4 Context to Meaning 24

5. Latent Dirichlet Allocation (Beyond Topic Modelling) 25

5.1 A Probabilistic Graphical Model of Documents and Topics 26

5.2 LDA with Verbs and Context 31

5.3 Recursive LDA 32

6. Methodology 34

6.1 Corpus 35

6.2 List of Verbs 35

6.3 Parsing Approaches 36

6.4 Features 39

7. Experiments & Results 40

7.1 List of Experiments 40

7.2 Triad Evaluation Task 42

7.3 Results 44

8. Conclusions 46

9. Future Work 47

Bibliography 48

Appendix A (List of 3,000 Verbs) 54

Appendix B (Verb Space Images) 62

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Palavra-chave
Committee Chair / Thesis Advisor
Committee Members
Última modificação

Primary PDF

Supplemental Files