Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering 公开

Diekmann, Yella (Spring 2025)

Permanent URL: https://etd.library.emory.edu/concern/etds/ms35tb372?locale=zh

Published

Abstract

Large language models (LLMs) have revolutionized the question answering (QA) domain by achieving near-human performance across a broad range of tasks. Recent studies have suggested LLMs can answer clinical questions and provide medical advice. Although LLMs' answers must be safe, existing evaluations of medical QA systems often only focus on the accuracy of the content. However, a critical, underexplored aspect is whether variations in patient inquiries -- rephrasing the same question -- lead to inconsistent or unsafe LLM responses. We propose a new evaluation methodology leveraging synthetic question generation to rigorously assess the safety of LLMs in patient-facing medical QA. In benchmarking 8 LLMs, we observe a weak correlation between standard automated quality metrics and human evaluations, underscoring the need for enhanced sensitivity analysis in evaluating patient medical QA safety.

Introduction Related Work Methodology Evaluation Results Discussion

About this Honors Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Emory College
Department	Computer Science
Degree	B.S.
Submission	Honors Thesis
Language	English
Research Field	Computer Science
关键词	Large language models question answering sensitivity analysis safety
Committee Chair / Thesis Advisor	Joyce C. Ho, Emory University
Committee Members	Carl Yang, Emory University Maria Franca Sibau, Emory University

Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering 公开

Diekmann, Yella (Spring 2025)

Abstract

Table of Contents

About this Honors Thesis

Primary PDF

Supplemental Files