Feature Vulnerability in Automated Essay Scoring — University of Amsterdam, 2026

Dataset at a Glance

Student essays

24,728

ASAP 2.0 corpus

Prompt types

Scored 1–6 by raters

AI variants

GPT-4o-mini, temp=0

AES features

surface · readability · coherence · syntactic

Feature Vulnerability in Action

As AI interventions intensify, fragile features shift dramatically while robust features remain stable — the core thesis of this work.

fragileintermediaterobust

Research Questions

Main RQ

To what extent can AES features distinguish AI-assisted from human-written text, and which features remain robust for quality assessment under AI assistance?

SRQ 1

How accurately can classifiers detect AI assistance using AES features, and does detection accuracy differ across intervention types?

SRQ 2

Which AES features show the highest importance for detecting AI assistance versus the lowest importance?

SRQ 3

Can quality assessment models using only robust features maintain performance on original essays while showing less degradation than all-feature models when applied to AI-assisted essays?

Feature Families

SSurface

Avg. sentence length
Word count
Sentence count
Avg. word length
MATTR (moving-average TTR)
MTLD
HDD
POS noun ratio
POS verb ratio
POS adjective ratio
POS adverb ratio
POS other ratio

RReadability

Flesch-Kincaid Grade
Coleman-Liau Index
Gunning Fog
SMOG Index
Automated Readability Index
Dale-Chall Readability Score
Linsear Write Formula

CCoherence

Connective frequency (per sentence)
Avg. lexical overlap (adjacent sentences)

YSyntactic

Mean dependency tree depth
Subordinate clause ratio
Passive ratio
Mean noun phrase modifiers
Pronoun density