“Moneyballing” Grades: Using Sabermetrics to Improve Feedback

I’m a big baseball fan, and so I always look forward to February when pitchers and catchers report to camp to start working for the upcoming season, before playing about a month’s worth of spring training games in small, fan-friendly parks. For me, a large part of the draw to baseball lies within the statistics and numbers on which the game is built, and I still have fond memories of reading over box scores in the newspaper as a kid.
This year, I’m particularly excited about the season, thanks to the Sabermetrics 101 course I took with edX. “Sabermetrics” is the statistical study of the game of baseball, as championed by Bill James and others. In short, Sabermetrics argues that we can learn more from the scientific study of the data produced by the game than we traditionally do through anecdote and simple observation. Using baseball data, Sabermetricians have asked a number of interesting questions that challenge conventional wisdom, including whether bunts are useful and where we should bat good-hitting pitchers in a lineup. It has since been popularized in Michael Lewis’ book Moneyball, which served as the source for the 2011 film.
Based on research into “rate stats,” or statistics expressed as percentages (e.g., home runs allowed per 9 innings, on-base percentage, etc.), Sabermetricians have begun questioning how long it takes certain stats to stabilize and approach a reasonable level of confidence that the particular stat reflects a player’s true talent (in other words, when is a sample size too small?). Stats vary rather widely with respect to the number of sample needed to observe this stability. For example, it takes around 60 plate appearances for strikeouts per plate appearance (K/PA) but around 1000 for batting average (BA). For more details on how these studies are done, including the stabilization results of other rate stats, see Russell Carleton’s seminal article.
We know that a player may get hot early in the season, posting a .400 batting average through May only to cool down later in the season. Likewise, we know that some players heat up later in the season, far outpacing their performance earlier in the season. Statistically, this well known effect, whereby measurements that fall outside of a normal range tend to make their way back on subsequent measurements, is called “regression to the mean,” and is observed commonly in the world. Exceptionally tall parents often give way to shorter children, not taller ones.
As educators, we are quite familiar with rate statistics in the form of grades, which are, after all, intended to be performance feedback in the form of percentages that are customarily translated into letters for the sake of ease. Grades are often determined using a linear weight formula similar to slugging percentage, with the final number produced by a weighted sum of participation, homework, quizzes, projects, etc. Quizzes may be weighted twice as much as homework but only half that of tests, for instance.
With this in mind, I can’t help but apply Russell Carleton’s idea to the study of grades. Do grades find stability after some number of measurements (or, when is a sample size no longer small)? In other words, if Buster is getting a 94% in Latin after 10 individual assessments (i.e., graded homework, quizzes, projects, etc.), will he still be at a 94% after 20, 40, or even more? Basically, I wonder…

How many graded assignments do we need to give students, before we can be reasonably sure that the grade produced by the assessments reflects the “true talent” of the student?

In my experience, it takes far fewer individual assessments than commonly thought to produce a stable grade: I see my students grades reaching a stable number in as few as 10-15 assessments, with numbers leveling off in the middle of the second quarter. With few exceptions I’ve been able to notice, deviations from this number (e.g., a bad quiz score) are followed up with regression to the mean. If it’s indeed the case that grade percentages stabilize more quickly than we think, we need to ask how effective a metric are our current grading models at giving the feedback that we assume them to be.
Part of my motivation in thinking about this question comes from colleagues who are constantly burdened by grading and spend their evening and weekend hours in getting it done, determined to put a grade on each and every bit of work students do under the belief that students won’t learn if they’re not graded. But if we’ve reached that magic number on which an overall grade has become firmly anchored, why do we as both teachers and students spend so much time with this redundant drudgery?

If Sabermetrics sheds any light on grades and a grade average won’t be substantially different whether we measure it with 10 or 40 assessments, why do we put ourselves through all this work?

How might we thus transform how we assess students in ways that help them grow as learners?
On account with my own dissatisfaction about what information grades were giving me, my own approach to grading has evolved over the past few years. My current grading schema includes homework (i.e., pop “quiz” assessments), grammar quizzes, vocab. quizzes, and projects, with all four categories weighted equally. I no longer grade participation (on which cf. my thoughts here), and I don’t grade homework. Instead, I use tools like Socrative, Kahoot, or simple pen-and-paper in place of homework to assess what students learned in doing their work. I have also begun allowing my students to retake as much as they’d like anything I’ve assigned a grade to, including the homework assessments and all quizzes (different versions). Grades are subsequently averaged and lower scores aren’t counted. With fewer but more meaningful grades, I’ve observed students focusing more of their efforts on learning, and I’ve seen a sharp increase in the quality of their efforts and improved self-efficacy (for the importance of self-efficacy in learning, see McGonigal’s book SuperBetter).

Giving students multiple chances to show what they actually know on quizzes has not only helped our students learn the material better, but I think it has also improved their confidence in themselves and their ability to develop a mastery of our course material.

Instead of moving through the course grade by grade, we have paused to focus on each assessment individually, creating opportunities for students to learn the material as well as they want before moving forward. In this way, students have reported that they feel more confident in their command of our course content, while they also feel less stress from having to study for quiz after quiz and test after test. The entire grading process, not just the grades themselves, is now a more meaningful feedback system.
In exploring the relationship between grades and learning in my own classes, I am happy with the progress I’ve made in making my grading system more meaningful for students, but I also believe that there’s more room for improvement. Just like there are some baseball measurements that are held in much higher regard than others for the value they reflect, like Wins Above Replacement (WAR) that aims to measure a player’s total contribution to his team, I wonder whether there’s such thing as an idea grading model that allows for more insight into student learning. I am eager to continue the analysis, starting with some questions for further exploration:

  • Do grades behave like many baseball rate statistics and stabilize over time?
  • If so, when do they stabilize (i.e., after 10, 20, 30, or more individual grades)?
  • How much variation is there between different teachers’ grading models?
  • What does an ideal grading model include and what does it neglect?
  • Assuming grades do behave this way, how might we design our grading schema around assessment quality, rather than quantity?

In education, as in baseball, a student’s progress and evaluation should be measured by more than a single number. But we are bound by a number of policies and conventions that force us to continue to create grades for our students that are too often used as flawed yardsticks for learning, college evaluation, and so much more. Until the system evolves to diminish the role that grades play, we’re forced to do the best we can with grades. Perhaps baseball can teach us a few new things about what rate statistics like grades actually mean and offer a few new paths to help make them more valuable. See you in Scottsdale (or Florida)!
I do quite a lot of my own professional learning through MOOCs, and, thanks to the variety of interesting courses they offer, edX has proven to be one of my favorite platforms. In addition to the SABR101 course on Sabermetrics discussed above, recent courses on Electronic Literature, The Rise of Superheroes, and Networks, Crowds, and Markets have all influenced project work I do with students, as well as my own thinking on curriculum design.
For more blogs by Moss, check out:

Stay in-the-know with all things EdTech and innovations in learning by signing up to receive the weekly Smart Update.

Discover the latest in learning innovations

Sign up for our weekly newsletter.


Jessie Chuang

How about focusing on process, at least as important as, on product? Students have different pre-knowledge and talents. What really can encourage students, even they are behind peers, is make their learning process visible and awarded. Process matters, it will build growth mindset in learners.
From Data Silos to Data Literacy (#xAPI)


sabermetrics is a good idea.

Leave a Comment

Your email address will not be published. All fields are required.