Operating Model · AI Data Factory

ASR / TTS Reference Sets

Your model scored well on the Pashto benchmark. The benchmark was one dialect, in studio conditions — and your users are neither.

The AI Data Factory's speech-data function: recognition and synthesis reference corpora for the under-resourced Afghan languages — representation designed before a microphone opens, transcription held to reference grade, evaluation splits built for trust, and every voice consented, compensated, and protected.

Request a confidential briefing See Speech & Voice AI

Why reference sets ASR corpora TTS corpora The speaker's perimeter In an engagement Explore

Why reference sets

The reference is the instrument. For these languages, the instrument barely exists.

Definition

ASR / TTS Reference Sets is the AI Data Factory's speech-data function: recognition and synthesis reference corpora for the under-resourced Afghan languages — Pashto, Dari, and twenty-two more — built as instruments. Composition is designed before recording (dialect bands, registers, speaker balance, acoustic conditions), transcription is held to reference grade with glossary-governed orthography, evaluation splits are versioned and contamination-conscious, and every voice is consented in the speaker's own language with synthesis scope explicit and separate.

Speech AI lives and dies by its reference data twice over. In training, the corpus sets what the model can hear at all. In evaluation, the reference transcript is the ruler every error rate is measured against — and a wrong or inconsistent reference makes the score unreliable: a strong model graded down by a sloppy transcript, a weak one passed by an easy register.

Recorded hours have begun to appear. A community-built open corpus has reached meaningful scale for Pashto — but it is recent, almost entirely prompted read speech, and skewed toward diaspora and Northern dialects. Open Dari data remains thin, usually folded into the better-resourced Persian. For most of the other twenty-two Afghan languages, there is effectively nothing.

Recorded is not a representative instrument. What exists carries a quiet skew — broadcast and read speech, formal registers, prestige dialect bands, clean conditions — because that is what was easy to record. A model trained and graded on that material performs exactly where it was sampled and fails exactly where it was not: the telephone line, the conversational register, the band the corpus never visited. The failure is then reported as a respectable average — which is how an unusable system passes review.

So the Factory treats representation as a design problem, not a collection accident. Coverage targets are set before recording begins — dialect bands, registers, speaker balance, acoustic conditions — and the transcription that follows is held to reference grade, because everything downstream is measured against it.

Exhibit 01 — The coverage gap

Where existing corpora sample (gold) against the registers and bands speakers actually produce. A representative instrument is the whole grid — designed, not assembled.

24Afghan languages and their dialect bands in scope

—Hours of reference speech produced · published only when approved

100%Of audio dialect-band tagged, speaker-attributed, and condition-labeled

2×Independent transcription passes on evaluation references

The doctrine

Recorded is not represented.

A few hours of one dialect in one register is a recording. Representation is designed — bands, registers, speakers, and conditions, balanced on purpose.

Recognition

Corpora designed to sound like the population, not the studio.

Designed collection

Corpus composition is specified before recording: dialect-band coverage targets, register mix — conversational, telephony, broadcast, read — speaker balance across gender, age, and region, and acoustic-condition diversity. Representation by design, never by what was easy to gather.

Diaspora-based recording

Collection runs through the Human Intelligence Collective's communities — speakers recruited transparently, consented in their own language, and compensated as contributors, under the perimeter below.

Reference-grade transcription

Verbatim transcripts with timestamps, speaker labels, and dialect-band tags; code-switching marked, never erased; and orthographic normalization governed through the firm's glossary function — because in these languages, spelling decisions are terminology decisions.

Evaluation splits built for trust

Held-out test sets with documented composition, versioned releases, and contamination-conscious sourcing — material recorded for the purpose, not scraped from the public web the models already trained on.

Synthesis

Voices recorded to specification, with the sound system covered.

Voice-talent corpora

Selected native voice talent per language and dialect band, recorded to studio specification — engaged as professionals, with consent whose scope explicitly and separately covers synthesis.

Script design

Phonetically balanced prompt sets designed per language — deliberate coverage of the sound system and its band variants, not whichever sentences were lying around.

Pronunciation lexicons

Grapheme-to-phoneme resources with dialect-band variants, built on the same governed lexical infrastructure as everything else the Factory ships.

The voices are people

A voice identifies a person. The data is governed accordingly.

Speech data is personal data — and for speakers from communities with genuine security considerations, it is sensitive personal data. The perimeter:

Informed consent, in the speaker's language.Consent is explained and confirmed in the language and register the speaker actually commands — comprehension, not a signature.

Explicit scope, with synthesis separate.Consent states what the voice will be used for; consent to synthesis — a voice that can be made to say new things — is explicit, separate, and never bundled into a general release.

Compensation and withdrawal.Speakers and voice talent are compensated as contributors, and withdrawal rights are honored on the record.

No scraped community audio.The Factory does not harvest sermons, calls, broadcasts, or social audio from communities that never consented. The industry shortcut is not used. Ever.

Gate 4 throughout.Population Risk governs collection, storage, and release — the same gate that governs everything the firm touches.

In practice

Corpora that arrive with their composition on paper.

Every corpus ships as an instrument, documented like one. The composition is on paper before you commit — which bands, which registers, which speaker balance, which conditions — and the delivered set matches the design or says exactly where it could not. Transcription quality is itself measured and reported, because a reference with an unknown error rate is not a reference. Evaluation splits arrive versioned, with their sourcing documented, ready to stand behind a number your team will be asked to defend. And the rights file travels with the data: consent records and scope, clean enough for procurement, responsible-AI review, and the synthesis question someone will eventually ask.

The composition document. Bands, registers, speakers, conditions — the corpus datasheet, readable before purchase and auditable after.

The reference's own quality record. Transcription verification metrics reported with the set — the instrument's error rate, stated.

Versioned, contamination-conscious evaluation splits. Built for measurement; documented sourcing; releases your benchmark numbers can cite.

A rights file procurement can sign. Consent records, explicit scopes, synthesis cleared separately — no provenance surprises.

The work behind the data

Every hour in the corpus began as a person who agreed to be recorded — recruited transparently, consented in their own language, and compensated as a contributor. The instrument is built on that, or it is not trusted.

24Afghan languages and dialect bands

0Security incidents

100%Senior-led engagements

41+Trust Center documents

Continue

Explore the AI Data Factory.

Annotation & Post-EditingThe labeling craft behind the transcripts.→Bilingual Glossary MaintenanceThe lexical governance the orthography stands on.→Low-Resource Model EvaluationsWhere these references become benchmarks.→The ADF PipelineThe end-to-end production methodology.→

← Back to the Orchestration Model

The door

Measure against a reference you can trust.

For speech AI teams whose Afghan-language coverage deserves an instrument, not an accident. Briefings are conducted under NDA, in Washington, D.C. or virtually.

Request a confidential briefing

Senior-led and scoped from the first conversation.