Two Essays on Content Engineering with Unstructured Data: Business Insights from User-Generated Content Open Access

Ko, Eun Hee (Spring 2019)

Permanent URL: https://etd.library.emory.edu/concern/etds/f7623d59h?locale=en

Published

Abstract

A primary driver behind the topics in my dissertation essays is the desire to address the challenges that marketing practitioners front into the market environment, where consumer behaviors are changing quickly with the expansion of platforms into new media that are native to computers or mobile devices, which have prompted continuous growth in marketing expenditures. While there is a wide range of research that studies user-generated content (UGC) and its impact on marketing or consumer purchasing behavior, few studies highlight the content characteristics with large-scale data from the field. Moreover, most of the existing empirical research that studies the semanticity of UGC pays limited attention to content beyond the text. To fill this gap, I have initiated and advanced several projects to investigate the content features not only from texts but images in my Ph.D. program. In doing so, I bring a variety of methodological approaches to my research (natural language processing, machine learning, and image processing techniques), having merged public and proprietary datasets – both longitudinal and cross-sectional. The first essay of my dissertation examines consumer engagement, measured as the number of likes and comments tied to a brand-themed social media post on Instagram. I study consumer engagement with brand-themed user-generated content – imaged-based social media posts tagged with #brandname – an increasingly common way that consumers engage with brands. I describe consumer engagement using characteristics of the image and the text of a post – visual sentiment, visual complexity, text sentiment, and text complexity – which I craft using techniques that include deep convolutional neural networks (Deep CNNs), and both a computer vision application programming interface (API) and natural language processing (NLP). Using data from over 86,000 Instagram posts collectively hashtagged with 86 product brand names, I find that visual sentiment and text sentiment are positively associated with higher levels of consumer engagement. Visual complexity and text complexity both positively affect consumer engagement at low and moderate levels, and become negative at high levels. Too much information either from images or from texts attenuates consumer engagement. Around the middle of the range of visual complexity there is an optimal level that makes a post rich and engaging. The second essay of my dissertation investigates factors that characterize manipulated reviews by concentrating on unstructured text data and brand strength as a factor associated with suspicious online review incidences. Studying over 270,000 Amazon.com reviews from 16 product categories, I find that approximately 3% of reviews are ones consumers would be suspicious about. Extreme emotions (e.g., fear, joy) account for a review being viewed as suspicious better than mixed emotions (e.g., anticipation, surprise) or low-arousal emotions (e.g., sadness). I argue that weaker brands have an incentive for review manipulation. I find that a weak brand status, described by lower advertising effort, is associated with suspicious reviews that are promotional (positive) in nature. Though, the effect fades away for suspicious reviews that are denigrating (negative).

Chapter 1. Overview .......................................................................................................................................... 1

1.1. Introduction ............................................................................................................................................ 1

1.2. User-Generated Content ....................................................................................................................... 2

1.3. Artificial Intelligence in Marketing ...................................................................................................... 3

1.4. Agenda of the Dissertation ................................................................................................................... 4

Chapter 2. Content Engineering of Images: The Effect of Sentiment and Complexity on Consumer

Engagement with Brand-Themed User-Generated Content ......................................................... 6

2.1. Introduction ............................................................................................................................................ 6

2.2. Background ............................................................................................................................................ 9

2.2.1. Consumer Engagement in Social Media ................................................................................... 11

2.2.2. Content Marketing ....................................................................................................................... 12

2.2.3. Machine Learning and Social Media ......................................................................................... 13

2.3. Conceptual Framework ...................................................................................................................... 13

2.3.1. Characteristics of the Post .......................................................................................................... 14

2.3.2. Brand Characteristics ................................................................................................................... 17

2.3.3. User Characteristics ..................................................................................................................... 18

2.4. Data ....................................................................................................................................................... 18

2.4.1. Raw Data and Sample Selection Criteria .................................................................................. 19

2.4.2. Variable Crafting of User Post Data ......................................................................................... 21

2.4.2.1. Deep CNNs for Visual Sentiment: DeepSentiBank .......................................................... 25

2.4.2.2. Computer Vision API, NLP, and Clustering for Visual Complexity and Object

Types .......................................................................................................................................... 27

2.4.2.3. Text Variables ...................................................................................................................... 32

2.4.3. Descriptive Statistics .................................................................................................................... 34

2.5. Empirical Analysis ............................................................................................................................... 40

2.5.1. Results ........................................................................................................................................... 41

2.5.1.1. Consumer Engagement and Image Content ................................................................... 42

2.5.1.2. Consumer Engagement and Text Content ...................................................................... 42

2.5.1.3. Consumer Engagement and Brand Characteristics ........................................................ 46

2.5.1.4. Consumer Engagement and User Characteristics .......................................................... 47

2.5.2. Simulation ..................................................................................................................................... 50

2.5.3. Robustness Check ....................................................................................................................... 50

2.5.4. Accounting for Commercial Posts and Multiple Brands ........................................................ 52

2.5.4.1. Data Cleaning Process ........................................................................................................ 52

2.5.4.2. Descriptive Statistics ........................................................................................................... 54

2.5.4.3. Empirical Strategy and Analysis Results ......................................................................... 64

2.5.4.4. Robustness Check ............................................................................................................... 71

2.6. Managerial Implications and Conclusions ....................................................................................... 76

Chapter 3. Suspicious Online Product Reviews and Brand Advertising Effort ..................................... 80

3.1. Introduction .......................................................................................................................................... 80

3.2. Background .......................................................................................................................................... 84

3.3. Create the Analysis Sample ................................................................................................................ 85

3.3.1. Initial Processing .......................................................................................................................... 87

3.3.2. Merging Advertising Expenditure Data ................................................................................... 87

3.3.3. Final Preprocessing ...................................................................................................................... 88

3.4. Classify and Label Reviews as Suspicious (or Not) ........................................................................ 90

3.4.1. Selecting a Training Set for Use by Human Evaluators ......................................................... 90

3.4.2. Coding the Training Set as Suspicious (or Not) Using Human Evaluators ........................ 91

3.4.3. Coding the Full Dataset as Suspicious (or Not) Using Semi-Supervised Classification .... 92

3.4.4. Results from Semi-Supervised Classification and Comparison with Supervised

Classifiers ......................................................................................................................................... 94

3.5. Characterize Suspicious Reviews Using Semantic Features .......................................................... 95

3.5.1. Semantic Characteristics of Suspicious Reviews .................................................................... 95

3.5.2. Robustness Check with Holdout Sample .............................................................................. 102

3.5.2.1. Diagnostic Analysis ........................................................................................................... 104

3.5.2.2. Predicted Power: Accuracy and ROC Curve ................................................................. 104

3.6. Predicted Modeling with Alternative Machine Learning Classifiers .......................................... 104

3.7. Explore Word2vec Model ................................................................................................................ 109

3.7.1. Labelling Texts .......................................................................................................................... 110

3.7.2. Fine-Tuning Learned Word Embedding from Word2vec .................................................. 111

3.7.3. Naïve Bayes ................................................................................................................................ 112

3.7.4. Random Forest .......................................................................................................................... 112

3.7.5. Result ........................................................................................................................................... 112

3.7.6. Conclusions ................................................................................................................................ 115

3.8. Merge Suspicious Reviews and Brand Advertising Effort .......................................................... 115

3.8.1. Determining the Cutoff ............................................................................................................ 118

3.8.2. Validation and Implementation of RD Design .................................................................... 118

3.8.3. Robustness ................................................................................................................................. 125

3.9. Beta Regression Model and Category Effect ................................................................................ 125

3.10. Conclusions ...................................................................................................................................... 131

Chapter 4. Conclusions ................................................................................................................................ 135

Bibliography ................................................................................................................................................... 137

Appendix 1. Brand List ................................................................................................................................ 151

Appendix 2. Instructions for Students (Instagram Data) ........................................................................ 153

Appendix 3. Objects Detected .................................................................................................................... 156

Appendix 4. Robustness Check ................................................................................................................... 157

Appendix 5. Spam Review Detection Algorithms .................................................................................... 162

Appendix 6. Instructions Given to Human Evaluators .......................................................................... 168

About this Dissertation

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Philosophy
Degree	Ph.D.
Submission	Dissertation
Language	English
Research Field	Engineering, Computer Business Administration, Marketing
Keyword	Unstructured Data User-Generated Content Artificial Intelligence
Committee Chair / Thesis Advisor	Douglas Bowman, Emory University
Committee Members	Zhongjian Lin, Emory University Daniel McCarthy, Emory University Diego Klabjan, Northwestern University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Two Essays on Content Engineering with Unstructured Data: Business Insights from User-Generated Content ()	2019-04-25 10:37:05 -0400	Download

Two Essays on Content Engineering with Unstructured Data: Business Insights from User-Generated Content Open Access

Ko, Eun Hee (Spring 2019)

Abstract

Table of Contents

About this Dissertation

Primary PDF

Supplemental Files