Low Resource RAG: From Slide Data Processing to RAG Systems 公开

Chung, Andrew (Spring 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/kd17cv244?locale=zh
Published

Abstract

To advance the understanding and development of successful retrieval-augmented generation systems, we examine various components to identify essential elements and potential performance improvements across different methodologies. Through collaboration with Hyundai, we develop a low-resource domain retrieval-augmented generation system designed to answer questions about automotive safety collision tests using information from multimodal slides. Our approach introduces a novel, language model-centric data processing pipeline that effectively transforms slide information into textual content suitable for retrieval and answer generation. We evaluate the performance of different state-of-the-art retrieval-augmented generation frameworks on our processed data, as well as different variations of embedding models. To assess our system's effectiveness, we generate synthetic question-answer pairs from our refined data to test the accuracy of different retrieval models. Furthermore, we create additional synthetic question-answer pairs specifically targeting the multimodal table and chart information extracted from the slides. Our findings indicate that utilizing fine-tuned embedding models and language models with the original retrieval-augmented generation framework achieves the highest accuracy. We also finetune Vision Large Language Models to see if open-sourcing our data processing pipeline is possible. We conclude by outlining next steps to encourage research toward developing open-source retrieval-augmented generation frameworks for low-resource domains.

Table of Contents

Contents

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 6

2.1 Background and trends in NLP and

information systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Exploring Retrieval-Augmented Generation . . . . . . . . . . . . . . . 7

2.3 Embedding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Multimodal Data Processing in RAG . . . . . . . . . . . . . . . . . . 12

2.5.1 Synthetic QA and Data Generation . . . . . . . . . . . . . . . 14

Approach 16

3.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Slide Data Processing . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.3 Additional Headers . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.4 Synthetic QA Generation . . . . . . . . . . . . . . . . . . . . 20

3.1.5 Evaluation of Synthetic Data . . . . . . . . . . . . . . . . . . 21

3.2 Embedding Model - BGE-M3 . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 RAG Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Storage of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.3 Evaluation of Retrieval . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Open Source VLLMs for OCR Processing . . . . . . . . . . . . . . . . 28

3.4.1 Fine-tuning VLLM . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.3 Evaluation of VLLM . . . . . . . . . . . . . . . . . . . . . . . 30

Experiments 34

4.1 Evaluation of RAG and Embedding Models . . . . . . . . . . . . . . 34

4.2 Evaluation of Table and Chart Data . . . . . . . . . . . . . . . . . . 37

4.3 Evaluation of VLLM . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Analysis 41

5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 RAG Frameworks & Retrieval Analysis . . . . . . . . . . . . . . . . . 48

5.2.1 Embedding Models . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.2 RAG Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Qwen 2.5 VL Fine-tuning Analysis . . . . . . . . . . . . . . . . . . . 49

5.3.1 Image Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3.2 Base Model vs. Finetuned Model Data . . . . . . . . . . . . . 50

5.3.3 LLM Judge Preference - Qualitative Analysis . . . . . . . . . 52

Conclusion 54

6.0.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.0.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Appendix 57

A.0.1 Converting image-to-text prompt: . . . . . . . . . . . . . . . . 57

A.0.2 Correction of results: . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography 60

About this Honors Thesis

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
关键词
Committee Chair / Thesis Advisor
Committee Members
最新修改

Primary PDF

Supplemental Files