Automated Essay Scoring in a High Stakes Environment

Dr. Mark Shermis

One of our goals here is to begin to create the data abundance mindset in U.S. K-12–prepping for policies and practices informed by big data surrounding anywhere anytime learning.  To that end, we like to highlight interesting projects and proposals–and we have a good one today.
Mark Shermis, Dean of Education at the Univeristy of Akron, is an academic advisor to an assessment innovation project that we’re working on and one of the most knowledgable folks we’ve found when it comes to automated scoring. He contributed a chapter to an academic book called Innovative Assessment for the 21st Centurey: Supporting Educational Needs.  He defines automated essay scoring (AES) as the evaluation of work via computer-based analysis.
Dr. Shermis laments the fact that US secondary students receive an average of 3 assessment per semester in a writing class.  We’d both like to see student writing every day and getting feedback instantly.
Here’s Mark’s proposal:

Rather than administer a high-stakes writing test in the Spring of each year, administer about 15 AES-scored essays throughout the year (it could be more).  The electronic portfolio could monitor student progress from the beginning of the year through to the end.  Towards the end of the year, average the scores for the last three writing assignments and use it as the accountability meausre for the doman of writing.  To keep the process secure, the topics for the last three prompts can be controlled by the state department of education and relased on a strict schedule.

Whether you like the policy proposal or not, AES is beginning to encourage a lot more writing with frequent entries into an electronic gradebook and detailed feedback (e.g., Pearson’s Write to Learn, CTB’s Writing Roadmap).*
In many cases, AES is as accurate as human grading.  The Common Core demands of writing to text (e.g., narrative,expository, descriptive) will challenge the current generation of scoring engines.  An upcoming demonstration will outline the contours of current capabilities.
Dr. Shermis points to a number of benefits of his proposal:

  • Integration, not competition, with instruction–assessment that informs instruction.
  • Students get instant feedback
  • Reliable picture of student abilities and a big trail of evidence
  • Realistic expectations of a no-surprises environment

We could add security to the list.  It will be harder to tamper with AES results compared to paper and pencil exams.  With a big trail of evidence of weekly entries a big change in score (up or down) on a high stakes exam would be easy to spot.
The benefits of instructional tech that can double as a test are numerous.  It’s great to have academics like Shermis pushing the boundaries of technology and policy and helping to create the big data future of learning.
 * Disclosure: Pearson is an investor in Learn Capital where Tom is a partner

Tom Vander Ark

Tom Vander Ark is the CEO of Getting Smart. He has written or co-authored more than 50 books and papers including Getting Smart, Smart Cities, Smart Parents, Better Together, The Power of Place and Difference Making. He served as a public school superintendent and the first Executive Director of Education for the Bill & Melinda Gates Foundation.

Discover the latest in learning innovations

Sign up for our weekly newsletter.


Randy Bennett

Tom, It's an intriguing idea but when it comes to high-stakes assessments, it's probably not as simple as the posting makes it sound. For discussion of some of the issues, see the Gates Foundation-funded, "Automated Scoring of Constructed-Response Literacy and Mathematics Items" at


Tom Vander Ark

Thanks Randy. Posted summary of the report here
Your advice seems appropriate for short term replacement/augmentation of human scoring but I'm curious about where you think AI scoring can push boundaries.
What role will innovative items/activities play (sims, games, experiments)?

Randy Bennett

Tom, I wouldn't call it near-term advice, at least not in the context of tests to be used for high-stakes decisions. Regardless of what form those tests take, traditional or not (e.g., simulations, games, experiments), the way the scores are generated matters and evidence to support the validity and fairness of those scores is essential. In that respect, the recommendations below apply in the near-term, and I would argue, in the long-term too.
1. Design a computer-based assessment as an integrated system in which automated scoring is one in a series of interre- lated parts
2. Encourage vendors to base the development of automated scoring approaches on construct understanding
3. Strengthen operational human scoring
4. Where automated systems are modeled on human scoring, or where agreement with human scores is used as a primary validity criterion, fund studies to better understand the bases upon which humans assign scores
5. Stipulate as a contractual requirement the disclosure by vendors of those automated scoring approaches being considered for operational use
6. Require a broad base of validity evidence similar to that needed to evaluate score meaning for any assessment
7. Unless the validity evidence assembled in Number 6 justifies the sole use of automated scoring, keep well-supervised human raters in the loop

Randy Bennett

Stated another way, games, simulations, and experiments may allow us to measure traditional competencies more effectively and to measure competencies we couldn't measure before. Automated scoring will play a key part in helping with that measurement. But evidence to support the validity and fairness of results, and processes to ensure quality, will continue to be necessary--as long as we use those results to make consequential decisions about individuals and institutions.

Leave a Comment

Your email address will not be published. All fields are required.