Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]

Hey guys! At my company, we've been benchmarking STT engines a lot and kept running into the same issue: WER is penalizing formatting differences that have nothing to do with actual recognition quality. "It's $50" vs "it is fifty dollars", "3:00PM" vs "3 pm". Both perfect transcription, but a terrible error rate.

The fix is normalizing both sides before scoring, but every project we had a different script doing it slightly differently. So we built a proper library and open-sourced it.

So we introduced gladia-normalization, where you can run your transcripts through a configurable normalization pipeline before you compute WER

from normalization import load_pipeline pipeline = load_pipeline("gladia-3", language="en") pipeline.normalize("It's $50 at 3:00PM") # => "it is 50 dollars at 3 pm"

Pipelines are YAML-defined so you know exactly what's running and in what order. Deterministic, version-controllable, customizable.

Currently supports English, French, German, Italian, Spanish and Dutch - though we know our non-English presets need refinement and we're actively looking for native speakers to contribute and help get the behavior right for each language 🙌!

MIT licensed, repo here → https://github.com/gladiaio/normalization

Curious how others are handling this. Drop a comment if you've been dealing with the same thing :)

submitted by /u/Karamouche
[link] [comments]

Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]

Want to read more?

Tagged with