When Large Language Models Meet Religious Text Restricted; Files Only

Choi, Jacob (Spring 2024)

Permanent URL: https://etd.library.emory.edu/concern/etds/j9602216f?locale=en
Published

Abstract

The field of AI has been quickly expanding outside of Computer Science, including areas such as healthcare, transportation, and the humanities. The intersection between AI and religion is also a growing field, but there exists a lack of computational work done from an application-based perspective. The current intersection in research between AI and religion often involves observing information that the models have learned, such as religious bias. For works that more directly impact communities, commercial AI-powered tools are available to help users learn more about religious texts, but lack transparency, which may be alarming for some. 

To contribute to the field of AI application in religion from a computational perspective outside of AI model bias observation, we perform a case study on the Bible by creating a verse extraction tool using deep learning techniques to showcase the process of creating such a tool for religious communities to use. To do this, we first explore a challenge common to those who study the bible by finding references. We utilized a semantic similarity search and the Hungarian algorithm to identify references, which we found infeasible yet impactful. We then introduce six datasets that we use to train a llama-2-7b-chat model to respond to user queries with Bible verses. Additionally, we create two test sets to evaluate models, the first asking fact-based questions and the second asking theological questions. We find that state-of-the-art commercial models still come out on top with the highest accuracy of 62.5 and 58.5, and we describe the next steps to encourage research toward this direction of application-based tools in the computer science domain for religion. 

Table of Contents

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Background and Trends in NLP for Religious Text Analysis . . . . . 5

2.2 Exploring NLP Tasks in Religious Texts . . . . . . . . . . . . . . . . 6

3 Finding References: An Exploration 9

3.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Maximum Weight Matching . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Takeaway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Model Training 20

4.1 What is a language model? . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 What is fine-tuning? . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Model Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Training Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Datasets 25

5.1 Overview of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

i

5.2 Bible Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 Instruction Fine Tuning Format . . . . . . . . . . . . . . . . . . . . . 26

5.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.4.1 Dataset 1: Similarity . . . . . . . . . . . . . . . . . . . . . . . 27

5.4.2 Dataset 2: Named Entity Recognition . . . . . . . . . . . . . . 29

5.4.3 Dataset 3: Version . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.4 Dataset 4: Situation . . . . . . . . . . . . . . . . . . . . . . . 30

5.4.5 Dataset 5: Single . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4.6 Dataset 6: References . . . . . . . . . . . . . . . . . . . . . . . 32

5.4.7 Combined Dataset . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Experiments and Results 33

6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Further Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conclusion 39

7.1 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . 39

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8 Final Remarks 41

A Appendix 46

Bibliography 50

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Keyword
Committee Chair / Thesis Advisor
Committee Members
Last modified Preview image embargoed

Primary PDF

Supplemental Files