Generalizability of automated scores of writing quality in grades 3-5

Joshua Wilson, Dandan Chen, Micheal P. Sandbank, Michael Hebert

Research output: Contribution to journalArticle

Abstract

The present study examined issues pertaining to the reliability of writing assessment in the elementary grades, and among samples of struggling and nonstruggling writers. The present study also extended nascent research on the reliability and the practical applications of automated essay scoring (AES) systems in Response to Intervention frameworks aimed at preventing and remediating writing difficulties (RTI-W). Students in Grade 3 (n = 185), Grade 4 (n = 192), and Grade 5 (n = 193) responded to six writing prompts, two prompts each in the three genres emphasized in the Common Core and similar "Next Generation" academic standards: narrative, informative, and persuasive. Prompts were scored using an AES system called Project Essay Grade (PEG). Generalizability theory was used to examine the following sources of variation in PEG's quality scores: prompts, genres, and the interaction among those facets and the object of measurement: students. Separate generalizability and decision studies were conducted for each grade level and for subsamples of nonstruggling and struggling writers identified using a composite measure of writing skill. Low-stakes decisions (reliability ≥ .80) could be made by averaging scores from a single prompt per genre (i.e., 3 total) or 2 prompts per genre if administered to struggling writers (i.e., 6 total). High-stakes decisions (reliability ≥ .90) could be made by averaging across two prompts per genre (6 total) or 4-5 prompts per genre if administered to struggling writers (12-15 total). Implications for use of AES within RTI-W and the construct validity of AES writing quality scores are discussed.

Original languageEnglish (US)
Pages (from-to)619-640
Number of pages22
JournalJournal of Educational Psychology
Volume111
Issue number4
DOIs
StatePublished - May 2019

Fingerprint

genre
school grade
writer
Students
construct validity
student
narrative
interaction
Research

Keywords

  • Assessment
  • Automated essay scoring
  • Generalizability
  • Struggling writers
  • Writing

ASJC Scopus subject areas

  • Education
  • Developmental and Educational Psychology

Cite this

Generalizability of automated scores of writing quality in grades 3-5. / Wilson, Joshua; Chen, Dandan; Sandbank, Micheal P.; Hebert, Michael.

In: Journal of Educational Psychology, Vol. 111, No. 4, 05.2019, p. 619-640.

Research output: Contribution to journalArticle

Wilson, Joshua ; Chen, Dandan ; Sandbank, Micheal P. ; Hebert, Michael. / Generalizability of automated scores of writing quality in grades 3-5. In: Journal of Educational Psychology. 2019 ; Vol. 111, No. 4. pp. 619-640.
@article{c8b4c1091c99475fa1d5bd17f50cb95c,
title = "Generalizability of automated scores of writing quality in grades 3-5",
abstract = "The present study examined issues pertaining to the reliability of writing assessment in the elementary grades, and among samples of struggling and nonstruggling writers. The present study also extended nascent research on the reliability and the practical applications of automated essay scoring (AES) systems in Response to Intervention frameworks aimed at preventing and remediating writing difficulties (RTI-W). Students in Grade 3 (n = 185), Grade 4 (n = 192), and Grade 5 (n = 193) responded to six writing prompts, two prompts each in the three genres emphasized in the Common Core and similar {"}Next Generation{"} academic standards: narrative, informative, and persuasive. Prompts were scored using an AES system called Project Essay Grade (PEG). Generalizability theory was used to examine the following sources of variation in PEG's quality scores: prompts, genres, and the interaction among those facets and the object of measurement: students. Separate generalizability and decision studies were conducted for each grade level and for subsamples of nonstruggling and struggling writers identified using a composite measure of writing skill. Low-stakes decisions (reliability ≥ .80) could be made by averaging scores from a single prompt per genre (i.e., 3 total) or 2 prompts per genre if administered to struggling writers (i.e., 6 total). High-stakes decisions (reliability ≥ .90) could be made by averaging across two prompts per genre (6 total) or 4-5 prompts per genre if administered to struggling writers (12-15 total). Implications for use of AES within RTI-W and the construct validity of AES writing quality scores are discussed.",
keywords = "Assessment, Automated essay scoring, Generalizability, Struggling writers, Writing",
author = "Joshua Wilson and Dandan Chen and Sandbank, {Micheal P.} and Michael Hebert",
year = "2019",
month = "5",
doi = "10.1037/edu0000311",
language = "English (US)",
volume = "111",
pages = "619--640",
journal = "Journal of Educational Psychology",
issn = "0022-0663",
publisher = "American Psychological Association Inc.",
number = "4",

}

TY - JOUR

T1 - Generalizability of automated scores of writing quality in grades 3-5

AU - Wilson, Joshua

AU - Chen, Dandan

AU - Sandbank, Micheal P.

AU - Hebert, Michael

PY - 2019/5

Y1 - 2019/5

N2 - The present study examined issues pertaining to the reliability of writing assessment in the elementary grades, and among samples of struggling and nonstruggling writers. The present study also extended nascent research on the reliability and the practical applications of automated essay scoring (AES) systems in Response to Intervention frameworks aimed at preventing and remediating writing difficulties (RTI-W). Students in Grade 3 (n = 185), Grade 4 (n = 192), and Grade 5 (n = 193) responded to six writing prompts, two prompts each in the three genres emphasized in the Common Core and similar "Next Generation" academic standards: narrative, informative, and persuasive. Prompts were scored using an AES system called Project Essay Grade (PEG). Generalizability theory was used to examine the following sources of variation in PEG's quality scores: prompts, genres, and the interaction among those facets and the object of measurement: students. Separate generalizability and decision studies were conducted for each grade level and for subsamples of nonstruggling and struggling writers identified using a composite measure of writing skill. Low-stakes decisions (reliability ≥ .80) could be made by averaging scores from a single prompt per genre (i.e., 3 total) or 2 prompts per genre if administered to struggling writers (i.e., 6 total). High-stakes decisions (reliability ≥ .90) could be made by averaging across two prompts per genre (6 total) or 4-5 prompts per genre if administered to struggling writers (12-15 total). Implications for use of AES within RTI-W and the construct validity of AES writing quality scores are discussed.

AB - The present study examined issues pertaining to the reliability of writing assessment in the elementary grades, and among samples of struggling and nonstruggling writers. The present study also extended nascent research on the reliability and the practical applications of automated essay scoring (AES) systems in Response to Intervention frameworks aimed at preventing and remediating writing difficulties (RTI-W). Students in Grade 3 (n = 185), Grade 4 (n = 192), and Grade 5 (n = 193) responded to six writing prompts, two prompts each in the three genres emphasized in the Common Core and similar "Next Generation" academic standards: narrative, informative, and persuasive. Prompts were scored using an AES system called Project Essay Grade (PEG). Generalizability theory was used to examine the following sources of variation in PEG's quality scores: prompts, genres, and the interaction among those facets and the object of measurement: students. Separate generalizability and decision studies were conducted for each grade level and for subsamples of nonstruggling and struggling writers identified using a composite measure of writing skill. Low-stakes decisions (reliability ≥ .80) could be made by averaging scores from a single prompt per genre (i.e., 3 total) or 2 prompts per genre if administered to struggling writers (i.e., 6 total). High-stakes decisions (reliability ≥ .90) could be made by averaging across two prompts per genre (6 total) or 4-5 prompts per genre if administered to struggling writers (12-15 total). Implications for use of AES within RTI-W and the construct validity of AES writing quality scores are discussed.

KW - Assessment

KW - Automated essay scoring

KW - Generalizability

KW - Struggling writers

KW - Writing

UR - http://www.scopus.com/inward/record.url?scp=85055547964&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055547964&partnerID=8YFLogxK

U2 - 10.1037/edu0000311

DO - 10.1037/edu0000311

M3 - Article

AN - SCOPUS:85055547964

VL - 111

SP - 619

EP - 640

JO - Journal of Educational Psychology

JF - Journal of Educational Psychology

SN - 0022-0663

IS - 4

ER -