Built a normalizer so WER stops penalizing formatting differences in STT evals! [P]
Hey guys! At my company, we've been benchmarking STT engines a lot and kept running into the same issue: WER is penalizing formatting differences that have nothing to do with actual recognition quality. "It's $50" vs "it is fifty dollars", "3:00PM" vs "3 pm". Both perfect transcription, but a terrible error rate.
The fix is normalizing both sides before scoring, but every project we had a different script doing it slightly differently. So we built a proper library and open-sourced it.
So we introduced gladia-normalization, where you can run your transcripts through a configurable normalization pipeline before you compute WER
from normalization import load_pipeline pipeline = load_pipeline("gladia-3", language="en") pipeline.normalize("It's $50 at 3:00PM") # => "it is 50 dollars at 3 pm" Pipelines are YAML-defined so you know exactly what's running and in what order. Deterministic, version-controllable, customizable.
Currently supports English, French, German, Italian, Spanish and Dutch - though we know our non-English presets need refinement and we're actively looking for native speakers to contribute and help get the behavior right for each language 🙌!
MIT licensed, repo here → https://github.com/gladiaio/normalization
Curious how others are handling this. Drop a comment if you've been dealing with the same thing :)
[link] [comments]
Want to read more?
Check out the full article on the original site