Dependency Analysis of Abstract Universal Structures in Korean and English Open Access

Chun, Che Yeol (Spring 2018)

Permanent URL: https://etd.library.emory.edu/concern/etds/sj1391960?locale=pt-BR%2A
Published

Abstract

This thesis gives two contributions in the form of lexical resources to (1) dependency parsing in Korean and (2) semantic parsing in English. First, we describe our methodology for building three dependency treebanks in Korean derived from existing treebanks and pseudo-annotated according to the latest guidelines from the Universal Dependencies (UD). The original Google Korean UD Treebank is re-tokenized to ensure morpheme-level annotation consistency with other corpora while maintaining linguistic validity of the revised tokens. Phrase structure trees in the Penn Korean Treebank and the Kaist Treebank are automatically converted into UD dependency trees by applying head-percolation rules and linguistically motivated heuristics. A total of 38K+ dependency trees are generated. To the best of our knowledge, this is the first time that the three Korean treebanks are converted into UD dependency treebanks following the latest annotation guidelines. Second, we introduce an on-going project for constructing a new corpus of Deep Dependency Graphs (DDG) which are converted from the phrase structure trees in the OntoNotes corpus with additional semantic information found in the Proposition Bank (PropBank) and Abstract Meaning Representation (AMR). This new dataset plays a pivotal role in our proposed novel AMR parsing scheme in which the data helps train a dependency parser, which is subsequently trained on a new AMR parsing task through transfer learning. Since AMR inherits the core semantic roles in PropBank, we speculate that the first training phase that exposes the parsing model to semantic role labeling task will greatly help the model perform AMR parsing. In this thesis, we address the preliminary step of integrating PropBank labels for predicate argument relations during the constituent-to-dependency conversion of the OntoNotes. It is our hope that the new corpus, with its ­rich syntactic information stored in DDG as well as semantic role information provided by PropBank that fully describes the predicate argument structure, will serve as a useful resource for semantic role labeling.

Table of Contents

1 Introduction

1.1 Motivation

1.2 Objectives

2 Background

2.1 Natural Language Structures

2.1.1 Parts of Speech

2.1.2 Morphological Analysis

2.1.3 Phrase Structure

2.1.4 Dependency Structure

2.1.4.1 Universal Dependencies

2.1.4.2 Deep Dependency Graph

2.1.5 Predicate Argument Structure

2.1.5.1 PropBank

2.2 Constituent-to-Dependency Conversion

2.3 Abstract Meaning Representation Parsing 

3 Approach

3.1 Dependency Conversion in Korean

3.1.1 Google UD Korean Treebank

3.1.1.1 Morphological Analysis

3.1.1.2 Re-Tokenization

3.1.1.3 POS Re-Labeling

3.1.1.4 Head ID Re-Mapping

3.1.1.5 Dependency Re-Labeling

3.1.1.6 Lexical Correction

3.1.2 Penn Korean Treebank

3.1.2.1 Mapping Empty Categories

3.1.2.2 Coordination

3.1.2.3 POS Tags

3.1.2.4 Dependency Relations 

3.1.3 Kaist Treebank

3.1.3.1 Coordination

3.1.3.2 POS Tags

3.1.3.3 Dependency Relations

3.2 Dependency-AMR Parsing

3.2.1 Transfer Learning

3.2.2 PropBank Integration into OntoNotes DDG Corpus

3.2.2.1 Coordination

3.2.2.2 Copulas 

4 Analysis

4.1 Dependency Treebanks in Korean

4.1.1 Corpus Analytics

4.1.1.1 POS Analysis

4.1.1.2 Dependency Analysis

4.1.2 Remaining Issues

4.2 PropBank-Augmented OntoNotes DDG Corpus

4.2.1 Remaining Challenges

5 Conclusion 

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files