Evaluating the Quality of Ratings in Writing Assessment: Rater Agreement, Error, and Accuracy Open Access

Wind, Stefanie Anne (2012)

Permanent URL: https://etd.library.emory.edu/concern/etds/hd76s0919?locale=en

Published

Abstract

Evaluating the quality of ratings in writing assessment:
Rater agreement, error, and accuracy
The purpose of this study is to examine the congruence among methods used to
evaluate the quality of ratings obtained in large-scale performance assessments. Within
the context of a large-scale writing assessment, this study focuses on the alignment
between operationally used indices of rater agreement, error and systematic bias, and
direct measures of accuracy within a traditional and Rasch-based approach. This study
uses 365 essays from the Georgia High School Writing Test that were rated by 20
operational raters and by a committee of "expert raters," whose scores were used to
compute direct accuracy measures. The Facets computer program (Linacre, 2010) is used
to compute all of the indices of rating quality. Major empirical findings suggest that
Rasch-based indices of model-data fit for ratings as well as indices of rater agreement
from Facets (Linacre, 2010) provide information about raters that is comparable to direct
measures of accuracy. Because direct measures of rater accuracy are often not attainable
in operational settings, the use of easily obtained approximations of direct accuracy
measures holds significant implications for monitoring rating quality in large-scale rater-
mediated performance assessments.

Evaluating the quality of ratings in writing assessment:
Rater agreement, error, and accuracy
Bachelor of Arts
Bachelor of Music
Advisor: George Engelhard, Jr.
A thesis submitted to the Faculty of the
James T. Laney School of Graduate Studies of Emory University in partial fulfillment of
the requirements for the degree of
Master of Arts
in Educational Studies: Quantitative Methodology
2012
September 12, 2011

Table of Contents

Theoretical Framework ................................................................................................................... 2!
Invariant Measurement ................................................................................................................ 2!
Rasch Measurement Theory ........................................................................................................ 3!
Brunswik's Lens Model .............................................................................................................. 6!
Significance of the Study ................................................................................................................ 8!
Purpose of the Study ....................................................................................................................... 9!
Research Questions ......................................................................................................................... 9!
Definitions....................................................................................................................................... 9!
Review of the Literature ............................................................................................................... 11!
Rater-Mediated Writing Assessment ............................................................................................ 11!
Indices of Rating Quality in Writing Assessment ......................................................................... 12!
Rater Agreement ....................................................................................................................... 13!
Selection of an Agreement Coefficient for Rater-Mediated Writing Assessments .............. 15!
Rater Error and Systematic Bias ............................................................................................... 16!
Rater Error and Systematic Bias within a Traditional Approach ......................................... 18!
Rater Error and Systematic Bias within a Rasch-based Approach ....................................... 18!
Interpreting Rater Error and Systematic Bias in Context ..................................................... 21!
Rater Accuracy .......................................................................................................................... 21!
Accuracy Measures within a Traditional Approach ............................................................. 22!
Accuracy Measures within a Rasch-based Approach ........................................................... 24!
Interpreting Rater Accuracy in Context ................................................................................ 25!
Methodology ................................................................................................................................. 26!
Instrument .................................................................................................................................. 26!
Participants ................................................................................................................................ 27!
Data Analysis ................................................................................................................................ 28!
Selected Indices within a Traditional Approach ....................................................................... 29!
Selected Indices within a Rasch-based Approach ..................................................................... 30!
Results ........................................................................................................................................... 31!
Summary of Statistical and Psychometric Measures ................................................................ 31!

Calibration of Ratings ........................................................................................................... 31!
Domain Calibrations! for Ratings ...................................................................................... 32!
Rating Scale Category Use ................................................................................................... 32!
Calibration of Accuracy Scores ............................................................................................ 33!
Domain Calibrations for Accuracy Ratings .......................................................................... 33!
Rating Quality Analyses ............................................................................................................ 34!
Rater Agreement ................................................................................................................... 34!
Rater Error and Systematic Bias ........................................................................................... 34!
Rater Accuracy ...................................................................................................................... 36!
Comparison of Rating Quality Indices ...................................................................................... 36!
Discussion ..................................................................................................................................... 38!
Limitations and Delimitations ................................................................................................... 41!
Conclusions in terms of Research Questions ............................................................................ 43!
Research Question One ......................................................................................................... 43!
Research Question Two ........................................................................................................ 45!
Implications ............................................................................................................................... 46!
Theory ................................................................................................................................... 46!
Research ................................................................................................................................ 46!
Policy and Practice ................................................................................................................ 47!
References ..................................................................................................................................... 49!

List of Tables and Figures

Table 1. Instrument Description....................................................................................... 55
Table 2. Categories of Rating Quality Indices ................................................................. 56
Table 3. Summary Statistics for Ratings .......................................................................... 57
Table 4. Indices of Rater Error and Systematic Bias ....................................................... 58
Table 5. Calibration of the Domain Facet ........................................................................ 59
Table 6. Rating Scale Structure ....................................................................................... 60
Table 7. Summary Statistics for Accuracy Ratings ......................................................... 61
Table 8. Indices of Rater Accuracy .................................................................................. 62
Table 9. Calibration of Accuracy Ratings within Domains ............................................. 63
Table 10. Correlations Among Traditional Indices ......................................................... 64
Table 11. Correlations Among Rasch-based Indices ....................................................... 65

Figure 1. Brunswik's Lens Model for Probabilistic Functionalism ................................. 66
Figure 2. Variable Map for Rating Data .......................................................................... 67
Figure 3. Variable Map for Accuracy Data ..................................................................... 68
Figure 4. Scatter Plots for Traditional Rating Quality Indices ........................................ 69
Figure 5. Scatter Plots for Rasch-based Rating Quality Indices ...................................... 70

Appendix A
IRB Documentation ......................................................................................................... 71

About this Master's Thesis

Rights statement

Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.

School	Laney Graduate School
Department	Educational Studies
Degree	MA
Submission	Master's Thesis
Language	English
Research Field	Education, Tests and Measurements
Keyword	Rasch Measurement Writing assessment Raters Rater error Rater agreement Rater accuracy
Committee Chair / Thesis Advisor	Engelhard, George, Emory University
Committee Members	Jensen, Robert J, Emory University Cheong, Yuk Fai, Emory University

Last modified

Primary PDF

Thumbnail	Title	Date Uploaded	Actions
	Evaluating the Quality of Ratings in Writing Assessment: Rater Agreement, Error, and Accuracy ()	2018-08-28 16:10:27 -0400	Download

Abstract

Table of Contents

About this Master's Thesis

Primary PDF

Supplemental Files