Feature Vulnerability in Automated Essay Scoring — University of Amsterdam, 2026

Dataset at a Glance

Student essays

24,728

ASAP 2.0 corpus

Prompt types

Scored 1–6 by raters

AI variants

GPT-4o-mini, temp=0

AES features

surface · readability · coherence · syntactic

Feature Vulnerability in Action

As AI interventions intensify, fragile features shift notably while robust features remain more stable: the core thesis of this work.

fragileneutralrobust

Research Questions

Main RQ

To what extent can automated essay scoring features distinguish AI-assisted from human-written text, and which features remain robust for quality assessment under AI assistance?

SRQ 1

How well can binary classifiers distinguish original student essays from AI-assisted variants across three intervention levels (grammar correction, style enhancement, substantive revision)?

SRQ 2

Which AES features show the highest importance for detecting AI assistance versus the lowest importance?

SRQ 3

Can quality assessment models using only robust features show less performance degradation than all-feature models when applied to AI-assisted essays?

Feature Families

SSurface

Avg. sentence length
Word count
Sentence count
Avg. word length
MATTR (moving-average TTR)
MTLD
HDD
POS noun ratio
POS verb ratio
POS adjective ratio
POS adverb ratio
POS other ratio

RReadability

Flesch-Kincaid Grade
Coleman-Liau Index
Gunning Fog
SMOG Index
Automated Readability Index
Dale-Chall Readability Score
Linsear Write Formula

CCoherence

Connective frequency (per sentence)
Avg. lexical overlap (adjacent sentences)

YSyntactic

Mean dependency tree depth
Subordinate clause ratio
Passive ratio
Mean noun phrase modifiers
Pronoun density