Categorical Evaluation for Advanced Distributional Semantic Models Open Access

Kilgore, Andrew Reid (2016)

Permanent URL:


Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form "If a is to a' then b is to what?" are answered by composing multiple word vectors and searching the vector space. During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract. In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.

Table of Contents

1 Introduction

1.1 Thesis Statement

2 Background

2.1 Word Representation

2.1.1 Word Embeddings

2.2 Analogy Tests

2.2.1 Vector Offset

2.2.2 Analogies

2.2.3 Syntactic Test Set

2.2.4 Test Set Customization

2.2.5 Scoring - Precision and Recall

2.3 Linguistic Structure

2.3.1 Dependency Structure

2.3.2 Predicate Argument Structure

2.3.3 Morphemes

2.4 Neural Network Models

2.4.1 NNLM

2.4.2 RNNLM

2.4.3 Word2Vec

2.4.4 Contexts

3 Approach

3.1 Corpus

3.2 Syntactic Contexts

3.2.1 First Order Dependency (dep1)

3.2.2 Semantic Role Label Head (srl1)

3.2.3 Closest Dependency Siblings (sib1)

3.2.4 First and Second Closest Dependency Siblings

3.3 Composite Models

3.3.1 All Siblings (allsib)

3.3.2 Second Order Dependency(dep2)

3.3.3 Second Order Dependency with Head (dep2h)

3.3.4 Siblings with Dependents

3.4 Ensemble Models

3.4.1 Model Inclusion

3.4.2 Categorical Model Selection

3.5 Analogy Testing

3.5.1 Scoring

3.6 Implementation

3.6.1 Arbitrary and Dependency Contexts

3.6.2 Analogy Testing Framework

3.6.3 Ensemble Models

4 Experiments

4.1 EmoryNLP Word2Vec

4.2 Lexical Evaluation

4.3 Grammatical Evaluation

4.3.1 Rank Scoring

4.4 Context Analysis

4.5 Ensemble Models

4.5.1 Diminishing Information

5 Conclusion

5.1 Future Work

5.1.1 Additional Models

5.1.2 Ensemble Models

5.1.3 Vector Space Analysis

Appendix A - Glossary

Appendix B -Syntactic Context Comparisions

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
  • English
Research Field
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files