Impact of Data Analysis on Nascent Natural Language Processing Tasks Open Access

Zhao, Boxin (Spring 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/6m311q71c?locale=en%5D
Published

Abstract

Understanding the importance of data analysis is essential for Natural Language Processing (NLP) research. While it is widely recognized that the better the data quality, the better the model performance, little to no effort has been made on quantifying the impact of data analysis in NLP research. For nascent NLP tasks, this judgement is even harder. This thesis presents a study of the influence of noisy dataset, falsely-targeted dataset, and existing unsuitale dataset on model performance and the impact that data analysis could make on these three types of incompetent datasets, respectively. Through fixing the noise labels in a noisy dataset, we have improved the model performance from 69% to 75% with the model structure unchanged; through re-pointing the falsely-targeted dataset to the application scenario, we worked out a deploy-able version of the model; and through creating a new dataset spanning over 1,000 application scenarios, the model trained on our dataset outperforms models trained on other datasets and zero-shot GPT. Our work has shown that data analysis could have a significant impact on nascent NLP tasks for all kinds of NLP data.

Table of Contents

1. Introduction

1.1 Research Question

1.2 Thesis Statement

2. Background

2.1 Transformers

2.1.1 Bidirectional Encoder Representations from Transformer 

2.1.2 Robustly Optimized BERT Pretraining Approach 

2.1.3 Text-to-Text Transfer Transformer 

2.1.4 Generative Pretrained Transformer 

2.2 Resume Classification & Screening 

2.3 Dialogue State Generation 

3. Competence-Level Classification and Resume & Job Description Matching

3.1 Task Definition

3.2 Task Motivation

3.3 Approach

3.3.1 Resume Parsing and Field Concatenation 

3.3.2 Fixing False Labels 

3.3.3 Creating Job Description Dataset 

3.3.4 Encoding Resumes with Labels 

3.4 Experiments

3.4.1 Data Split

3.4.2 Modeling

3.4.3 Results

3.4.4 Code Packaging

3.5 Analysis

3.5.1 Confusion Matrix

3.5.2 Ablation Studies

3.5.3 Result Human Analysis

3.5.4 Limitation and Future Works

4. Dialogue State Generation

4.1 Task Definition

4.2 Task Motivation

4.2.1 Shaping Chat Bot Behavior

4.2.2 Dialogue State Tracking

4.2.3 Slot Discovery

4.3 Approach

4.3.1 DSG Zero-shot with GPT Pipeline

4.3.2 Developing New Dataset for Dialogue State Generation

4.3.3 Analysis on GPT-Generated Data

4.4 Experiments

4.4.1 Dataset

4.4.2 Modeling

4.4.3 Results and Evaluation

5. Conclusion

A Appendix

A.1 CRC1

A.2 CRC2

A.3 CRC3

A.4 CRC4

A.5 Glossary of Terms

B Appendix

B.1 Original Resume Example

B.2 Concatenated Input

C Appendix

Bibliography

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified

Primary PDF

Supplemental Files