Speech Fairness

Speech AI progress is not evenly distributed

Bigger speech models can look better every year while still leaving many English speakers with unstable, unreliable transcripts.

Alexander MetzgerJune 19, 2026

Pre-print DraftASR FairnessWorld Englishes

ASR word error rate plot for ALLSTAR DHR accents

The draft studies 16 speech recognition systems released from 2021 to 2025 across 8 accent-aware evaluations. The core finding is uncomfortable: the gap between mainstream and minority English varieties is stubborn, and minority accents can suffer large regressions between model generations.

The optimism we tested

There is a familiar assumption in machine learning: bigger models, more data, and newer releases will eventually wash away old biases. For speech recognition, that would mean that today's accent gaps are temporary glitches. Wait a few generations and the rising tide lifts every speaker.

Connecting the Dots asks whether that story holds up over time. We evaluate systems on datasets that include Mainstream US English and Standard Southern British English alongside many minority English varieties. Instead of asking whether one model has one average word error rate, we ask how performance changes for different speakers across successive releases.

What we found

Mainstream US English is consistently protected by standard benchmarking practice. It often stays stable and can approach very low error rates. Minority accents, by contrast, can improve in one generation and degrade sharply in the next. In some cases, the draft finds degradations as large as 65 percentage points in absolute word error rate between systems.

That means "state of the art" is not a single global fact. A model can be newer and better on a leaderboard while becoming worse for a speaker whose accent was not represented in the evaluation culture surrounding that model.

Why it matters

Speech recognition is now used in schools, workplaces, immigration systems, hiring workflows, call centers, and accessibility tools. In those settings, bad transcripts are not merely annoying. They can change who gets understood, who gets help, and who gets treated as competent.

The fix is not one magic model. It is continuous, disaggregated auditing: developers and deployers need to check which speakers are helped, which speakers are harmed, and whether a new release quietly breaks performance for communities outside the benchmark center.