Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums Public

Huryn, Daniil (Spring 2022)

Permanent URL: https://etd.library.emory.edu/concern/etds/3n204054m?locale=fr

Published

Abstract

Unsupervised Machine Learning models have taken the Natural Language Processing world by storm. Transformers, the currently most popular unsupervised models, utilize vast amounts of data and deliver performance far beyond what could have been achieved only a few years ago. As good as these models are, they have one major requirement - a lot of data. One of the first transformers, BERT, was trained on 3.3 Billion words of data, and later models have used even more (GPT-3). This presents unsupervised dialogue models with a bit of a problem: there's not that much high quality dialogue data out there, certainly not on the scale required. Because Dialogue is far harder to find online then posts, articles, etc., high quality datasets are usually very limited in size (Switchboard, Daily Dialog), while high quantity datasets (Opensubtitles, Reddit Corpus) are either of extremely low quality or of a very specific type, for instance movie subtitles. One of the main mitigations of this issue has been to first train models on large amounts of low quality data, and then fine-tune on low amounts of high quality data. In this paper, we propose a different solution: to create a high quantity, medium quality, multi-turn dataset, that will allow for far better model training. To do this, we intend to utilize a more computational approach to dialogue creation, where we create it from a set of Reddit posts and their respective comments, blending it in a way that creates a new conversation out of a disjointed online forum post. By utilizing the nature of Reddit threads and a variety of Natural Language Processing metrics, we intend to first construct and then thoroughly filter conversations to automatically create a large dataset of high quality dialogues.

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1 Dialogue Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Intentions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

2.1 Current Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 6

2.2.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

2.2.2 Blenderbot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 GRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 NSP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Initial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Early Improvements . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

3.2.4 BlenderBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.5 GRADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 19

3.2.6 Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . . . 19

About this Honors Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Emory College
Department	Computer Science
Degree	B.S.
Submission	Honors Thesis
Language	English
Research Field	Computer Science
Mot-clé	Dialogue Dataset Natural Language Processing
Committee Chair / Thesis Advisor	Choi, Jinho D., Emory University
Committee Members	Hulgan, Jonathan , Emory University Li, Ting, Emory University

Dernière modification

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums ()	2022-04-29 15:07:04 -0400	Download

Framework for Automatic Generation of Large-scale Dialogue Data from Online Forums Public

Huryn, Daniil (Spring 2022)

Abstract

Table of Contents

About this Honors Thesis

Primary PDF

Supplemental Files