Low-Resource Model Evaluations
The model card lists Pashto as supported. “Supported” meant the tokenizer accepts it. Whether it works, nobody ever measured.
Low-Resource Model Evaluations is the AI Data Factory’s benchmarking function: testing AI systems — language models, translation, speech, and safety — in linguistic environments where the public benchmarks are thin, translated, contaminated, or simply absent. The firm builds the instrument first, then takes the measurement: native-authored test sets, qualified human evaluation, and results reported per language and per dialect band, never as an average.
“Supported” is a claim about the tokenizer. “Working” is a measurement someone qualified had to build.
For these languages, the benchmark is usually the first broken component.
In high-resource languages, evaluation is a procedure: pick the benchmark, run the model, read the score. In the Afghan languages, the procedure fails at step one, because the benchmark itself fails in four reliable ways. Coverage is thin — the major multilingual suites include at most one or two of the twenty-four, often none. What exists is frequently translated — test items rendered from English, so the benchmark measures translationese rather than the language. Much of it is contaminated — scraped from the same few public websites the models trained on, which inflates every score. And the reference answers are often unvalidated — nobody qualified ever adjudicated whether the “correct” answers are correct.
Layered over all four is the reporting problem: multilingual performance is published as an average, and an average across dozens of languages hides the per-language collapse precisely where it is worst. A model can be genuinely strong in most languages, absent in Pashai, and carry one respectable number for all of it.
So evaluation here is not a procedure; it is instrument-making. Before the measurement can mean anything, someone has to author the items in the language, hold them out of the training data, validate the references, and commit to reporting that refuses the average. That is this function.
Four failure modes, usually present together. Each one alone voids the score.
Supported is not working.
Between the model card’s claim and the deployed reality sits a measurement someone qualified had to build. For these languages, the firm builds it.
Four system classes, one evaluation standard.
Language models
Generation quality, instruction-following, factuality and hallucination behavior in-language, refusal and safety behavior in-language, and cultural validity — the dimensions where a fluent model fails quietly.
Machine translation
Human evaluation on MQM-style error typologies — adequacy, fluency, terminology, register — with terminology conformance checked against the firm’s governed glossaries, and automatic metrics used alongside, with their low-resource calibration limits stated.
Speech systems
Recognition error rates and synthesis quality measured on the firm’s reference sets — per dialect band and acoustic condition, never just overall.
Safety and integrity systems
Classifier and filter performance on Afghan-language harms — the coverage question most safety stacks have never been asked in these languages.
Before the measurement, the instrument — built to a written standard.
Every benchmark this function ships is constructed to the same six-point standard, before a single measurement is taken.
Results that survive your own skeptics.
The reporting standard is the part the buyer never sees coming — the rules this function will not break, on every evaluation it runs.
A measurement your team can defend in the room.
Engagements start from your deployment, not from a shelf benchmark: the languages and bands your population actually speaks, the tasks your system will actually perform, turned into an instrument built to the construction standard. The evaluation then produces numbers with their evidence attached — items, references, rater qualifications, agreement metrics — so when your leadership, your customer, or your regulator asks how you know the system works in Dari, the answer is a document, not a model card. And because the benchmark is versioned, it persists: the next model version is measured against the same instrument, and progress becomes a fact instead of a feeling.
Every measurement on this page rests on a human judgment — made in the language, by someone qualified, and put on the record.
Reference answers are adjudicated by senior linguists; raters are drawn from the Collective and qualified under the Expert Network Standards — the gold properly made, before any score is read.
Explore the AI Data Factory.
Find out whether supported means working.
For AI teams whose Afghan-language claims will eventually meet a user, a customer, or a regulator. Briefings are conducted under NDA, in Washington, D.C. or virtually.
Request a confidential briefingSenior-led · under NDA · Washington, D.C. or virtual.