ASR / TTS Reference Sets
Your model scored well on the Pashto benchmark. The benchmark was one dialect, in studio conditions — and your users are neither.
The AI Data Factory's speech-data function: recognition and synthesis reference corpora for the under-resourced Afghan languages — representation designed before a microphone opens, transcription held to reference grade, evaluation splits built for trust, and every voice consented, compensated, and protected.
The reference is the instrument. For these languages, the instrument barely exists.
ASR / TTS Reference Sets is the AI Data Factory's speech-data function: recognition and synthesis reference corpora for the under-resourced Afghan languages — Pashto, Dari, and twenty-two more — built as instruments. Composition is designed before recording (dialect bands, registers, speaker balance, acoustic conditions), transcription is held to reference grade with glossary-governed orthography, evaluation splits are versioned and contamination-conscious, and every voice is consented in the speaker's own language with synthesis scope explicit and separate.
Speech AI lives and dies by its reference data twice over. In training, the corpus sets what the model can hear at all. In evaluation, the reference transcript is the ruler every error rate is measured against — and a wrong or inconsistent reference makes the score unreliable: a strong model graded down by a sloppy transcript, a weak one passed by an easy register.
Recorded hours have begun to appear. A community-built open corpus has reached meaningful scale for Pashto — but it is recent, almost entirely prompted read speech, and skewed toward diaspora and Northern dialects. Open Dari data remains thin, usually folded into the better-resourced Persian. For most of the other twenty-two Afghan languages, there is effectively nothing.
Recorded is not a representative instrument. What exists carries a quiet skew — broadcast and read speech, formal registers, prestige dialect bands, clean conditions — because that is what was easy to record. A model trained and graded on that material performs exactly where it was sampled and fails exactly where it was not: the telephone line, the conversational register, the band the corpus never visited. The failure is then reported as a respectable average — which is how an unusable system passes review.
So the Factory treats representation as a design problem, not a collection accident. Coverage targets are set before recording begins — dialect bands, registers, speaker balance, acoustic conditions — and the transcription that follows is held to reference grade, because everything downstream is measured against it.
Where existing corpora sample (gold) against the registers and bands speakers actually produce. A representative instrument is the whole grid — designed, not assembled.
Recorded is not represented.
A few hours of one dialect in one register is a recording. Representation is designed — bands, registers, speakers, and conditions, balanced on purpose.
Corpora designed to sound like the population, not the studio.
Voices recorded to specification, with the sound system covered.
A voice identifies a person. The data is governed accordingly.
Speech data is personal data — and for speakers from communities with genuine security considerations, it is sensitive personal data. The perimeter:
Corpora that arrive with their composition on paper.
Every corpus ships as an instrument, documented like one. The composition is on paper before you commit — which bands, which registers, which speaker balance, which conditions — and the delivered set matches the design or says exactly where it could not. Transcription quality is itself measured and reported, because a reference with an unknown error rate is not a reference. Evaluation splits arrive versioned, with their sourcing documented, ready to stand behind a number your team will be asked to defend. And the rights file travels with the data: consent records and scope, clean enough for procurement, responsible-AI review, and the synthesis question someone will eventually ask.
Every hour in the corpus began as a person who agreed to be recorded — recruited transparently, consented in their own language, and compensated as a contributor. The instrument is built on that, or it is not trusted.
Explore the AI Data Factory.
Measure against a reference you can trust.
For speech AI teams whose Afghan-language coverage deserves an instrument, not an accident. Briefings are conducted under NDA, in Washington, D.C. or virtually.
Request a confidential briefingSenior-led and scoped from the first conversation.