Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums Pubblico
Huryn, Daniil (Spring 2022)
Abstract
Unsupervised Machine Learning models have taken the Natural Language Processing world by storm. Transformers, the currently most popular unsupervised models, utilize vast amounts of data and deliver performance far beyond what could have been achieved only a few years ago. As good as these models are, they have one major requirement - a lot of data. One of the first transformers, BERT, was trained on 3.3 Billion words of data, and later models have used even more (GPT-3). This presents unsupervised dialogue models with a bit of a problem: there's not that much high quality dialogue data out there, certainly not on the scale required. Because Dialogue is far harder to find online then posts, articles, etc., high quality datasets are usually very limited in size (Switchboard, Daily Dialog), while high quantity datasets (Opensubtitles, Reddit Corpus) are either of extremely low quality or of a very specific type, for instance movie subtitles. One of the main mitigations of this issue has been to first train models on large amounts of low quality data, and then fine-tune on low amounts of high quality data. In this paper, we propose a different solution: to create a high quantity, medium quality, multi-turn dataset, that will allow for far better model training. To do this, we intend to utilize a more computational approach to dialogue creation, where we create it from a set of Reddit posts and their respective comments, blending it in a way that creates a new conversation out of a disjointed online forum post. By utilizing the nature of Reddit threads and a variety of Natural Language Processing metrics, we intend to first construct and then thoroughly filter conversations to automatically create a large dataset of high quality dialogues.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 Dialogue Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.2 Intentions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
2.1 Current Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 6
2.2.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.2.2 Blenderbot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 GRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 NSP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Initial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2 Early Improvements . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
3.2.4 BlenderBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 GRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 19
3.2.6 Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . . . 19
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Parola chiave | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums () | 2022-04-29 15:07:04 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|