Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering Open Access
Diekmann, Yella (Spring 2025)
Abstract
Large language models (LLMs) have revolutionized the question answering (QA) domain by achieving near-human performance across a broad range of tasks. Recent studies have suggested LLMs can answer clinical questions and provide medical advice. Although LLMs' answers must be safe, existing evaluations of medical QA systems often only focus on the accuracy of the content. However, a critical, underexplored aspect is whether variations in patient inquiries -- rephrasing the same question -- lead to inconsistent or unsafe LLM responses. We propose a new evaluation methodology leveraging synthetic question generation to rigorously assess the safety of LLMs in patient-facing medical QA. In benchmarking 8 LLMs, we observe a weak correlation between standard automated quality metrics and human evaluations, underscoring the need for enhanced sensitivity analysis in evaluating patient medical QA safety.
Table of Contents
Introduction Related Work Methodology Evaluation Results Discussion
About this Honors Thesis
School | |
---|---|
Department | |
Degree | |
Submission | |
Language |
|
Research Field | |
Keyword | |
Committee Chair / Thesis Advisor | |
Committee Members |
Primary PDF
Thumbnail | Title | Date Uploaded | Actions |
---|---|---|---|
|
Evaluating Safety of Large Language Models for Patient-facing Medical Question Answering () | 2025-04-21 15:42:32 -0400 |
|
Supplemental Files
Thumbnail | Title | Date Uploaded | Actions |
---|