Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums Public

Huryn, Daniil (Spring 2022)

Permanent URL: https://etd.library.emory.edu/concern/etds/3n204054m?locale=fr
Published

Abstract

Unsupervised Machine Learning models have taken the Natural Language Processing world by storm. Transformers, the currently most popular unsupervised models, utilize vast amounts of data and deliver performance far beyond what could have been achieved only a few years ago. As good as these models are, they have one major requirement - a lot of data. One of the first transformers, BERT, was trained on 3.3 Billion words of data, and later models have used even more (GPT-3). This presents unsupervised dialogue models with a bit of a problem: there's not that much high quality dialogue data out there, certainly not on the scale required. Because Dialogue is far harder to find online then posts, articles, etc., high quality datasets are usually very limited in size (Switchboard, Daily Dialog), while high quantity datasets (Opensubtitles, Reddit Corpus) are either of extremely low quality or of a very specific type, for instance movie subtitles. One of the main mitigations of this issue has been to first train models on large amounts of low quality data, and then fine-tune on low amounts of high quality data. In this paper, we propose a different solution: to create a high quantity, medium quality, multi-turn dataset, that will allow for far better model training. To do this, we intend to utilize a more computational approach to dialogue creation, where we create it from a set of Reddit posts and their respective comments, blending it in a way that creates a new conversation out of a disjointed online forum post. By utilizing the nature of Reddit threads and a variety of Natural Language Processing metrics, we intend to first construct and then thoroughly filter conversations to automatically create a large dataset of high quality dialogues.

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1 Dialogue Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Intentions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

2.1 Current Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 6

2.2.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

2.2.2 Blenderbot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 GRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 NSP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Initial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Early Improvements . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

3.2.4 BlenderBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.5 GRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 19

3.2.6 Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . . . 19

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Mot-clé
Committee Chair / Thesis Advisor
Committee Members
Dernière modification

Primary PDF

Supplemental Files