Home/Operating Model/AI Data Factory/Low-Resource Model Evaluations
Operating Model · AI Data Factory

Low-Resource Model Evaluations

The model card lists Pashto as supported. “Supported” meant the tokenizer accepts it. Whether it works, nobody ever measured.

Low-Resource Model Evaluations is the AI Data Factory’s benchmarking function: testing AI systems — language models, translation, speech, and safety — in linguistic environments where the public benchmarks are thin, translated, contaminated, or simply absent. The firm builds the instrument first, then takes the measurement: native-authored test sets, qualified human evaluation, and results reported per language and per dialect band, never as an average.

The gap this page closesSUPPORTEDThe tokenizer accepts the language.WORKINGNo measurement exists — until someone builds one.

“Supported” is a claim about the tokenizer. “Working” is a measurement someone qualified had to build.

Authored·Held out·Adjudicated·Reported
Why it’s different here

For these languages, the benchmark is usually the first broken component.

In high-resource languages, evaluation is a procedure: pick the benchmark, run the model, read the score. In the Afghan languages, the procedure fails at step one, because the benchmark itself fails in four reliable ways. Coverage is thin — the major multilingual suites include at most one or two of the twenty-four, often none. What exists is frequently translated — test items rendered from English, so the benchmark measures translationese rather than the language. Much of it is contaminated — scraped from the same few public websites the models trained on, which inflates every score. And the reference answers are often unvalidated — nobody qualified ever adjudicated whether the “correct” answers are correct.

Layered over all four is the reporting problem: multilingual performance is published as an average, and an average across dozens of languages hides the per-language collapse precisely where it is worst. A model can be genuinely strong in most languages, absent in Pashai, and carry one respectable number for all of it.

So evaluation here is not a procedure; it is instrument-making. Before the measurement can mean anything, someone has to author the items in the language, hold them out of the training data, validate the references, and commit to reporting that refuses the average. That is this function.

Exhibit 01 · Where the benchmark breaks
ThinOne or two of the twenty-four — often none.
TranslatedRendered from English — it measures translationese.
ContaminatedScraped from the model’s own training pages.
UnvalidatedNo qualified adjudication of the “correct” answers.

Four failure modes, usually present together. Each one alone voids the score.

24Afghan languages, evaluable to one standard
4System classes: language models, translation, speech, safety
100%Native-authored test items; no translated test sets
0Averages reported without their per-language, per-band breakdown
The doctrine

Supported is not working.

Between the model card’s claim and the deployed reality sits a measurement someone qualified had to build. For these languages, the firm builds it.

The scope

Four system classes, one evaluation standard.

01

Language models

Generation quality, instruction-following, factuality and hallucination behavior in-language, refusal and safety behavior in-language, and cultural validity — the dimensions where a fluent model fails quietly.

02

Machine translation

Human evaluation on MQM-style error typologies — adequacy, fluency, terminology, register — with terminology conformance checked against the firm’s governed glossaries, and automatic metrics used alongside, with their low-resource calibration limits stated.

03

Speech systems

Recognition error rates and synthesis quality measured on the firm’s reference sets — per dialect band and acoustic condition, never just overall.

04

Safety and integrity systems

Classifier and filter performance on Afghan-language harms — the coverage question most safety stacks have never been asked in these languages.

The instrument

Before the measurement, the instrument — built to a written standard.

Every benchmark this function ships is constructed to the same six-point standard, before a single measurement is taken.

01
Native-authored.Test items are created in the language, by qualified native speakers — never translated from an English benchmark, because translated tests measure the translation.
02
Task-realistic.Items reflect what institutions actually deploy these systems to do — comprehension, form completion, consent explanation, safety refusals — not trivia the deployment will never see.
03
Contamination-conscious.Purpose-built and held out; versioned releases with a deliberate disclosure policy, so the test stays a test.
04
References, validated.Every reference answer adjudicated by senior linguists with documented rationale — the gold properly made, as everywhere in the Factory.
05
Band-stratified.Items tagged by dialect band at construction, so the reporting can show what the average would hide.
06
Documented.Every benchmark ships with its datasheet: composition, authorship qualifications, validation record, version.
The numbers, honestly

Results that survive your own skeptics.

The reporting standard is the part the buyer never sees coming — the rules this function will not break, on every evaluation it runs.

Per language, always.No multilingual soup; every language gets its own line, including the lines that embarrass the model.
Per band, where bands diverge.The firm’s standing rule — an average is not parity — applied to every evaluation it runs.
Human evaluation, qualified and measured.Raters from the Collective, qualified under the Expert Network Standards, with inter-annotator agreement reported — agreement among the qualified, where it means something.
Automatic metrics, with their limits stated.Scores reported alongside human judgment, never instead of it; low-resource calibration caveats stated in the deliverable, not the footnote.
Errors typed, not just counted.MQM-style taxonomies tell engineering what fails — terminology, register, hallucination, refusal — not merely how much.
In practice

A measurement your team can defend in the room.

Engagements start from your deployment, not from a shelf benchmark: the languages and bands your population actually speaks, the tasks your system will actually perform, turned into an instrument built to the construction standard. The evaluation then produces numbers with their evidence attached — items, references, rater qualifications, agreement metrics — so when your leadership, your customer, or your regulator asks how you know the system works in Dari, the answer is a document, not a model card. And because the benchmark is versioned, it persists: the next model version is measured against the same instrument, and progress becomes a fact instead of a feeling.

An instrument built for your question. Your languages, your bands, your tasks — not the nearest available benchmark.
Numbers with their evidence. Items, validated references, rater qualifications, and agreement metrics — auditable end to end.
Failures typed, not just counted. The error taxonomy hands engineering a worklist, not a grade.
A baseline that persists. Versioned benchmarks; every future model version measured against the same ruler.
Qualified human judgment

Every measurement on this page rests on a human judgment — made in the language, by someone qualified, and put on the record.

Reference answers are adjudicated by senior linguists; raters are drawn from the Collective and qualified under the Expert Network Standards — the gold properly made, before any score is read.

24Afghan languages and dialect bands
0Security incidents
100%Senior-led engagements
41+Trust Center documents
Request a briefing

Find out whether supported means working.

For AI teams whose Afghan-language claims will eventually meet a user, a customer, or a regulator. Briefings are conducted under NDA, in Washington, D.C. or virtually.

Request a confidential briefing

Senior-led · under NDA · Washington, D.C. or virtual.