Go figure, one MIT prof thinks he can game one engine and the New York Times and NPR do stories on his anecdotal claims and miss the evidence of many large scientific studies including the study released this month.

Students coming up the writing learning curve don’t know how to put together the right words in good coherent essays. Automated scoring systems help them do that. Highly educated people can take a good essay and make it worse by changing a few words. However, it doesn’t hide the fact that the person still knows the material. It is much harder for a student to take their poor essay and make it look good. The best way to do that is to learn the material. The best way to learn to write is to practice and automated essay scoring helps teachers demand more writing.

Automated scoring engines are widely deployed in professional certification, SAT testing, state testing programs, and used in thousands of classrooms. The NYTimes article glosses over the facts that these systems are operational and there are plenty of mechanisms for human oversight including having second graders, backreads and teachers reading the essays. The concerns raised may be the same concerns with students writing essays for humans.

This year I’ve had the chance to speak with executives at nine assessment organizations on a weekly basis.  Peter Foltz, Pearson Knowledge Technologies, noted this week that, “There are also ways to have the computer detect a number of problem of people trying to game the system. We have mechanisms that detect plagiarism, unusually creative, off-topic, larding large words, unusual wording.  So many of these issues are already addressed in existing systems.”

Pearson will score about 10 million constructed responses this year and currently have about 24,000 teachers using our software in classrooms. Foltz said, “We don’t have teachers complaining about students gaming the system. The teachers are still reading the student essays, but they don’t have to monitor every draft.    Instead, we hear about students learning to write and think.”  Here are a few examples uses of automated essay scoring:

  • Starting in 2010, South Dakota replaced their 45 minute paper and pencil summative writing assessment with WriteToLearn used as a formative assessment. All students in 5th, 7th and 10th grade are required to do writing using the software at least three times during the school year. Teachers have the ability to decide what prompts will be assigned to the students and how the writing is incorporated into their lesson plans.  In a study conducted last year and presented at the National Council for Measurement in Education conference in 2011 the data from 255,000 student submissions was presented.  Students wrote an average 3.5 drafts on each prompt.  With five revisions, student scores increased by almost one point (on a six point scale).  Teachers receive immediate reports on the individual students as well as the class and can use the results to work with students on areas needing improvement.  With the old paper summative test, it took 3-6 weeks for teachers to receive any feedback. (Foltz, P. W., Lochbaum, K. E., & Rosenstein, M. B. (2011). Analysis of student writing for a large scale implementation of formative assessment.)
  • The East Palo Alto Tennis and Tutoring used WriteToLearn for an afterschoool tutoring program for struggling students.   One focus was teaching summarization skills where students read a text and then had to write a summary that was automatically scored for the quality of the content.  Kesha Weekes, EPATT academic director,  said that previously students would focus on the mechanics of writing – looking for mistakes in spelling, punctuation and subject-verb agreement- but they didn’t know how to focus on writing for show they comprehend the content.  “WriteToLearn shows them where they’ve omitted an area of content, so they’re taking a more critical look at their writing”.   “They would never have done that before”.  (Case study at: http://www.writetolearn.net/CaseStudies/EPATT-WriteToLearn-082007.pdf)
  • WriteToLearn is being used at Cedar Shoals High School in Athens, Georgia as a way of helping ELL student master the English language.   Students get feedback in six writing traits, ideas, organization, conventions, sentence fluency, word choice and voice, and can focus on several writing concepts at one time, as opposed to learning one concept now and then another later.   For ELL writers who to come to writing in English with highly varied skills, it provides a more personalized way of learning where their strengths lie and how to improve in the other areas. (Case study at: http://www.writetolearn.net/CaseStudies/WTL-CedarShoalsCaseStudy-061209.pdf)

The Hewlett Foundation funded Automated Student Assessment Prize (ASAP) was designed to create evidence that supports the use of automated essay scoring to affordably incorporate more writing on their annual tests and avoid tests limited to multiple choice items.  There will be additional field trials, but ASAP has provided sufficient evidence to support widespread use of automated scoring.  States will be encouraged to be thoughtful about ways the combine expert grading and automated scoring in to support high stakes decision making.

A foundational benefit of the shift to personal digital learning that will occur in this decade worldwide is formative assessment that runs in the background all day long providing periodic structured feedback to every student. It will enable customized learning pathways.  It will empower better teaching.  It will boost motivation, persistence, and the quality of student work products.  It will extend achievement and degree completion. That’s the real story.

Disclosures: ASAP is a project of Open Education Solutions where Tom is CEO.  Pearson is a limited partner in Learn Capital where Tom is a partner


  1. Where to begin? The Shermis study, as a “scientific” study, was shite, so loaded with weasel phrases like “the software by and large replicated human reading performance” that no actual precise claims were asserted as having been proven, yet the overall impression created was that they had. Look at the actual numbers and “by and large” means things like .85 agreement for human readers and .65 for the software — remembering that .5 agreement equals a coin-toss. That’s not “by and large replicating,” really. Look also at the elaborate machinations required by the machines: most texts scored 200 words or less. *4 weeks* and *60%* of the total 22,000-paper sample required to train the machines to begin with. This writing has nothing to do with what students actually need to be writing in college. Clear evidence that the software founders with texts of length greater than 500 words, and that the best predictor of score on shorter pieces is … length. Doesn’t even matter if, to get that length, you just repeat a paragraph already written. Numerous demonstrations show that the software doesn’t notice or care.

    Look at the machines’ utter incompetence with anything it’s not actually programmed to handle: any initiative from students, any deviation from the 5-paragraph format, any novelty, and creativity. (ALL writing is creative, and it’s terribly unfortunate that people like you who don’t realize that have any say in writing instruction and assessment whatsoever, but it accounts for why you can with a straight face claim machine-scoring is adequate.)

    What you quote Foltz as saying about gaming the computer can detect — like “unusual creativity” — it “detects” by spitting the piece out as “unscorable,” which is, “this didn’t fit my incredibly narrrowly programmed parameters so I can’t even tell what it is.” That’s a true success there, Tom. Yet you claim this is superior to human reading and interaction with a text.

    What you refer to as Perelman anecdotally beating one machine has been demonstrated by many researchers (see, e.g., several studies in Ericsson and Haswell, 2006) on each of the three major AES softwares. You guys just figure if you keep ignoring it, the press will keep writing adoring headlines based on polished-turd studies like Shermis’s. So far, it’s working. Well played.

    But please spare us your indignation when we call bullshit and pull back the curtain on the software to show that, sure, it can score 150 words on a 3 point integer scale just like a human would, if you give it a month of training on 13,000 samples — but that in any given text, it can’t actually tell the difference between nonsense and content. Proven, proven, proven, time and again. You might try actually responding to that concern by admitting it’s the case, and then explaining why we shouldn’t care.


Please enter your comment!
Please enter your name here