Impact of Data Analysis on Nascent Natural Language Processing Tasks Público
Zhao, Boxin (Spring 2024)
Abstract
Understanding the importance of data analysis is essential for Natural Language Processing (NLP) research. While it is widely recognized that the better the data quality, the better the model performance, little to no effort has been made on quantifying the impact of data analysis in NLP research. For nascent NLP tasks, this judgement is even harder. This thesis presents a study of the influence of noisy dataset, falsely-targeted dataset, and existing unsuitale dataset on model performance and the impact that data analysis could make on these three types of incompetent datasets, respectively. Through fixing the noise labels in a noisy dataset, we have improved the model performance from 69% to 75% with the model structure unchanged; through re-pointing the falsely-targeted dataset to the application scenario, we worked out a deploy-able version of the model; and through creating a new dataset spanning over 1,000 application scenarios, the model trained on our dataset outperforms models trained on other datasets and zero-shot GPT. Our work has shown that data analysis could have a significant impact on nascent NLP tasks for all kinds of NLP data.
Table of Contents
1. Introduction
1.1 Research Question
1.2 Thesis Statement
2. Background
2.1 Transformers
2.1.1 Bidirectional Encoder Representations from Transformer
2.1.2 Robustly Optimized BERT Pretraining Approach
2.1.3 Text-to-Text Transfer Transformer
2.1.4 Generative Pretrained Transformer
2.2 Resume Classification & Screening
2.3 Dialogue State Generation
3. Competence-Level Classification and Resume & Job Description Matching
3.1 Task Definition
3.2 Task Motivation
3.3 Approach
3.3.1 Resume Parsing and Field Concatenation
3.3.2 Fixing False Labels
3.3.3 Creating Job Description Dataset
3.3.4 Encoding Resumes with Labels
3.4 Experiments
3.4.1 Data Split
3.4.2 Modeling
3.4.3 Results
3.4.4 Code Packaging
3.5 Analysis
3.5.1 Confusion Matrix
3.5.2 Ablation Studies
3.5.3 Result Human Analysis
3.5.4 Limitation and Future Works
4. Dialogue State Generation
4.1 Task Definition
4.2 Task Motivation
4.2.1 Shaping Chat Bot Behavior
4.2.2 Dialogue State Tracking
4.2.3 Slot Discovery
4.3 Approach
4.3.1 DSG Zero-shot with GPT Pipeline
4.3.2 Developing New Dataset for Dialogue State Generation
4.3.3 Analysis on GPT-Generated Data
4.4 Experiments
4.4.1 Dataset
4.4.2 Modeling
4.4.3 Results and Evaluation
5. Conclusion
A Appendix
A.1 CRC1
A.2 CRC2
A.3 CRC3
A.4 CRC4
A.5 Glossary of Terms
B Appendix
B.1 Original Resume Example
B.2 Concatenated Input
C Appendix
Bibliography
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Palabra Clave | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Impact of Data Analysis on Nascent Natural Language Processing Tasks () | 2024-04-25 14:35:22 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|