Automated Essay Scoring: Wave of the Future?

I’ll just come right out and say it: I am fundamentally opposed to automated essay scoring. For all the reasons I’ve mentioned throughout this blog, and throughout this course, I think that AES runs counter to every part of my pedagogical stance. We’ve discussed assessment as technology, and certainly AES is a technology, and I understand its merits. Sure, there are situations that do not involve classroom assessment. There are times when large-scale, high-stakes tests call for quick, efficient, reliable scoring of massive numbers of essays. But my question is, should there be?

AES does not treat composition as a process, but instead as a product. Machines code essays for criteria that have nothing to do with ideas or expression, but instead focus on whatever quantifiable aspects of writing there are. Here, we say to our students, jump through this hoop and do x, y, and z, and you’ll pass the test. Assessments like this demonstrate a complete break between taught and tested curriculum, between pedagogy and assessment practices. I see the merit of AES as efficient, quick, and technologically motivated, but I ask—again, mirroring assertions by Wardle and Roozen, Broad, Yancey, Reilly and Atkins, and others—isn’t reliability here at risk of overturning validity? AES measures what it’s supposed to measure, sure. But, is what it’s measuring really what we want to measure? Our pedagogy and the current moment in composition theory would indicate that, no, AES does not measure process, diversity, exploration, or reflection. Instead, it makes our students into algorithms, into nodes on a network, into mindless drones churning out five-paragraph essays that fit the formula but hold no pedagogical value whatsoever. There are ways to conduct large-scale assessments, but when we let the machines do the scoring, how far are we from letting the machines do the writing as well?


Evaluating (Digital) Writing and Programs

Here, I’ll expand on and adapt my response to Reilly and Atkins’ “Rewarding Risk” from Digital Writing Assessment and Evaluation. (See the original post here.) This chapter builds on works on multimodal assessment, but adds a level of “risk” and risk-taking for students and “aspirational” assessment for instructors. The authors use “deliberate practice” as a model for creating open-ended assignment criteria that encourage individual and collaborative exploration. This model has as its goals encouraging exploration and aligning assessment practices with pedagogical practices.

Reilly and Atkins adapt Lee Odell and Susan Katz’s assertion that multimodal assessment should be “generalizable and generative.” To this, the authors add that multimodal assessment should also be “aspirational, prompting students to move past the skills they have already learned to bravely take on unfamiliar tasks and work with new tools and applications that may cause them to re-vision their composing practices. So, they’re not so much challenging Odell and Katz as adding to them; but the idea of generalizability seems to push back pretty hard against their assertion that assessments should be designed for individual assignments. They also call for an addition to Michael Neal’s four criteria for responding to “hypermedia,” which they argue are useful and productive, but do not address “how to encourage risk-taking and experimentation in conjunction with or through assessment processes.”

I’m intrigued by the way the authors propose a formative, rather than summative, approach to assessment. This means that composition students are taught about assessment and included in the creation of assessment criteria at the outset of an assignment. The thought here is that students will approach their assignments and take risks, knowing from the beginning what assessment model will be used while also knowing that they, themselves, had a hand in determining the assessment used.

I find Reilly’s guidelines for practicing aspirational assessment particularly interesting and, perhaps, helpful for future use in my own classroom:

• Allow time for play, exploration, and gains in proficiency prior to the discussion of assessment for a particular project.
• Look at (preferably externally identified) examples of excellent projects.
• Develop criteria in groups after reviewing the project description, client needs (if relevant), and the course student learning outcomes pertinent to the project.
• Allow student criteria to stand even if you, as the instructor, would have chosen other items on which to focus.
• Make room for peer review and revision time following the development of the assessment criteria.

I’m especially intrigued by the idea of “play, exploration, and gains in proficiency.” Here, I’ve always approached multimodal assignments a little backwards, by my own admission. I’ve resisted providing students with examples of exemplary work for fear that it might lead to unnecessarily narrowing the scope of what they could do. But, I’ve also found that this leads to confusion on expectations. I also like the idea of allowing criteria to stand even if I wouldn’t have chosen to focus on them myself. I’ve always been a fan of having students help me to create rubrics for assignments, but I’ve also generally steered the discussion so that the rubrics turn out more or less the way I want them to. Multimodal composition is not easy to assess because it is almost never apples to apples; instead it’s apples to oranges to kiwis to cars to shoes to houses. There’s nothing generalizable about it, but if multimodal assessment is contextually driven and built on these ideas of rewarding risk and exploration, I think that assessing multimodal assignments at the classroom level is quite doable.

Access, Accessibility, and Diversity

Here, I’ll focus on Arnetha Ball’s “Expanding the Dialogue” and Haswell’s “Gender Bias.” But, first I’ll mention that this discussion leads me to an article we read for our pedagogy bootcamp: Wendy Bishop’s “Teaching Grammar for Writers in a Process Classroom.” To demonstrate my own process in dealing with questions of diversity, I’ll reprint part of my initial response to that article here: 

Let me get this out of the way first: I swear by Strunk and White’s The Elements of Style. I will argue the merits of the Oxford comma with anyone who will listen, and I take grammar and style seriously… I just don’t see any evidence in Bishop’s discussion that the sorts of grammar and style elements that work in “Grammar B” writing aren’t also the same elements that work in “Grammar A” writing. I’m not a philistine, but there’s got to be a standard (stop me if I get preachy). We are here to help students think critically and find their voices, but couldn’t one argue that that’s exactly what they’re doing in a public speaking class next door in Diffenbaugh? What’s the difference here? We’re writing. We’re composing. (Read the full response here.)
I sound like a philistine, right? My opinions on grammar and style have changed dramatically now that I’ve actually spent two semesters in the classroom. These things which once seemed so important to me now take a backseat to composition as self-discovery. Arnetha Ball discusses the importance of diversity not only on the part of students, but also on assessors and teachers, writing that, “It is time to include the voices of teachers from diverse backgrounds in discussions concerning writing assessment,” and that through their presence, these voices “can not only inform, but re-shape current assessment practices” (380). Same goes for Haswell and Haswell, who assert that, “Assessors, teachers, and students must question their own assumptions and practices, read with a new awareness, talk with each other, and be open to a situation that is too complex to accommodate universal claims” (429-29). 
And these “universal claims” are what I think is at the heart of this discussion. Today’s writing classroom and today’s writers are diverse and—to go back again to Wardle and Roozen—they bring in their own ecological experiences when entering our classrooms or submitting to a writing test. We must not assess them on some rigid, right vs. wrong criteria, but instead must find ways to allow for a multitude of voices operating in their own contexts. 

Assessment as Technology

I’d like to pick up a question asked (and nicely discussed) by both Jacob and David: Is assessment a technology? To discuss this question, I’d like to frame it around my own (admittedly problematic) definition of technology that I developed for Kathleen Yancey’s Digital Revolution and Convergence Culture course this semsester: technology is any ephemeral tool that will one day be replaced by another ephemeral tool that claims to do the same job better. My argument may seem hard-headed, but I’ll put it this way: on a long enough timeline, the usefulness and lifespan of any technology is apt to end, replaced by something new and presumably better.

In discussing technology in terms of writing assessment, I’ll also point to Asao Inoue’s definition of technologies as “generative or productive.” So, if assessment practices—like new technologies—come and go in waves (as Yancey suggests in “Looking Back as We Look Forward”), then perhaps it is helpful to think of assessment as a technology. I think there’s a certain comfort in this idea: we’re not stuck with any one technology (if a computer doesn’t do what we need it to, we develop one that does), just like we’re not stuck with any one assessment. Looking back at the histories of assessment that we’ve read, it’s clear to me that assessment is a technology that is always in flux, responding to various exigences at various historical moments and always looking for the best, newest, most efficient, most valid and reliable way to address the problem of assessment. As I begin to think about computerized scoring and psychometrics as possible examples of one end of the technological spectrum, and Wardle and Roozen’s ecological model as the other, I can see a tension in the direction we’re headed. But it’s a process. And that’s a good thing.

Theories and Models of Assessment

Of all the readings this week, I am most drawn to Wardle and Roozen’s “Addressing the Complexity of Writing Development: Toward an Ecological Model of Assessment,” so that’s what I’ll focus on here. Here, Wardle and Roozen use an ecological model of writing assessment to address Kathleen Blake Yancey’s call for a fourth wave of assessment practices that address the complexities of composition in the 21st century. Wardle and Roozen call for “a perspective that situates students’ writing development across an expansive ecology of literate activities rather than within any single setting” (107). This model calls into account students’ writing practices in and out of school and the ways in which those practices intersect, which I think is a smart move in developing genuinely valid assessments of writing (based not on performance on a single test or writing sample, but instead on a larger ecological framework in which instances of writing intersect and inform one another).

In discussing what such an ecological model might look like, Wardle and Roozen put it this way:

What, then, might an ecological model of assessment look like and who would it involve? An
ecological assessment extends beyond one group, acknowledges the multiple sites where writing takes
place, continues over several years, engages multiple assessments simultaneously, involves multiple
leaders using varied data collection methods, and maintains some level of coordination with shared
results and data. (114)

Naturally, problems arise with this kind of assessment. This model would “require relationship-building among and across groups that often do not talk” (114). It would require time and communication among disparate groups. It would require patience on the part of those doing the assessments and an understanding that students’ experiences in our classes or programs are only one knot in a complicated network of knots that indicate their ability to write. This involves career-long portfolios, opportunities for selection, collection, and reflection, and a much larger picture of students as writers than any one semester, assignment, or test can account for. I’m not sure exactly how this would look in practice, but as a firm believer in both composition and life as processes, I am drawn to the ideas and implications presented here.

Validity and Reliability

I’d like to point back to Kathleen Blake Yancey’s “Looking Back as We Look Forward” from our histories unit. I found this essay especially helpful, and I’m sure I’ll refer back to it throughout this course. There, Yancey discusses validity and reliability (two terms that I’ll openly admit to having a hard time wrapping my mind around)  as a pendulum swinging and assessment as “an exercise in balancing the twin concepts of validity and reliability” (135). For Yancey, “Validity means measuring what you intend to measure, reliability that you can measure it consistently” (135). This sounds simple enough, but I think it also raises really difficult questions about just what it is we’re trying to measure with tests and assessments, and how we can achieve any sort of reliability on something as seemingly subjective as writing assessment.

Camp, in her essay “Changing the Model for the Direct Assessment of Writing,” also points to a tension between reliability and validity, pointing to psychometrics and multiple-choice writing tests as examples of high validity. But she also calls into question whether these tests are, in fact, valid. I think it comes back to Yancey’s point: if validity means measuring what you intend to measure, we must ask ourselves, just what is it we’re trying to measure with multiple-choice writing tests? Camp points to Cooper and others, who argue that “multiple-choice tests do not sample the full range of knowledge and skills involved in writing” (105).

I do appreciate Moss’s assertion, based on Haswell, that writing assessment should be open to public review, which “not only enhances the program that is the focus of the evaluation effort, but more importantly enhances the practice of evaluation itself” (158). This gets me to my own feeling about these issues: in my own experience, I don’t see how one can create an assessment measure that is not entirely contextual. Is this valid? I think so, because it examines what is being tested and adjusts the assessment to an individual context? Is it reliable? Now, there I’m not so sure…

Varied and Contested Histories

I am struck by Huot and Neal’s honest assertion that by focusing on technology as a lens through which to view the history of assessment, “Our representation of writing assessment will be limited” (417). That helps me to realize the importance of developing my own frame when discussing writing assessment in this course and beyond. I have found things in each of these histories that intrigues me, but I also realize that in order to understand even a fraction of the current moment in writing assessment, I’ve got to figure out where I fit into it, where it intersects with my own pedagogical and research interests.

The most arresting aspect of these readings for me is Huot and Neal’s assertion in their techno-history that assessments are never neutral, and that technology is ideological. David makes an excellent point in his blog post about assessment models being responses to exigences, and reading through these histories, I can’t help but be surprised by the historical moments that spurred changes in the field. I guess I’m wondering if not something as traumatic as war, what is our exigence today? Is it arguing for humanity in an age of machine scoring? Maybe.

I’m also intrigued by Elliot’s “lone wolves” and the hero mentality in assessment; I know I’ve got a touch of that myself. Behizadeh and Engelhard make interesting points about the need to find greater communication between theorists and practitioners. This is also echoed in Yancey’s discussion of portfolio assessment and the need for assessment to be linked to pedagogy. These things all bring up questions of how we can link pedagogy, outcomes, goals, and everything else that goes with teaching (or admissions or placement, whatever the case may be) to assessment? Is there a better test or assignment that we can design that won’t feel like just a hoop for our students to jump through? And once we’ve designed it, how do we assess it?