Home/Operating Model/AI Data Factory/Annotation & Post-Editing
Operating Model · AI Data Factory

Annotation & Post-Editing

Your vendor reports high annotator agreement. Three strangers agreeing in a language none of them mastered is not accuracy — it is shared error, confidently measured.

Annotation & Post-Editing is the AI Data Factory's human-in-the-loop function: training and validation data for multilingual models, labeled and corrected in-language by qualified native speakers across all 24 Afghan languages and their dialect bands — under versioned guidelines, senior adjudication, and the Five-Gate Validation Protocol™, with the quality metrics shipped alongside every delivery.

Every label, three acts
  • 01Label.
  • 02Adjudicate.
  • 03Document.

A judgment made in the language, resolved by a senior, and shipped with its record.

Why judgment

The industry's annotation machinery assumes a crowd that does not exist.

Annotation & Post-Editing, defined

Annotation & Post-Editing is the AI Data Factory's human-in-the-loop function: training and validation data for multilingual models, labeled and corrected in-language by qualified native speakers across all 24 Afghan languages and their dialect bands — under versioned guidelines, calibration, measured inter-annotator agreement with senior adjudication, firm-authored gold sets, and the Five-Gate Validation Protocol, with quality metrics shipped alongside every delivery. It exists because the industry's crowd-plus-majority-vote machinery assumes a qualified crowd that, for these languages, does not exist: agreement is not accuracy.

Commodity annotation runs on three parts: an anonymous crowd, a per-label price, and majority vote as quality control. For high-resource languages it limps along. For Afghan languages it collapses at the first part — there is no platform crowd of qualified Wakhi annotators, no pool of Pashai raters three-deep on every item — and everything downstream collapses with it. Majority vote among the unqualified does not produce truth; it produces consensus error. Gold questions cannot rescue it, because someone qualified has to make the gold. And the model trained on the output inherits all of it, permanently: the data sets the ceiling.

The honest conclusion is that every label is a judgment — a linguistic and cultural decision someone was qualified to make, or was not. For the Afghan languages, the people qualified to make those judgments exist almost entirely in the diaspora, and they are not anonymous, not interchangeable, and not available at a penny a label. So the firm built the function around them instead of around the crowd: known, qualified native speakers, working under written guidelines, with disagreement treated as signal for senior adjudication rather than noise to be averaged away.

Exhibit 01 — Agreement is not accuracythree annotators agreequalified judgment, on target

Three annotators agree — the tight cluster — and land far from accurate. One qualified judgment lands on target. Consensus measures only itself.

24Afghan languages, annotated in-language with their dialect bands
100%Annotators qualified under the Expert Network Standards
0Anonymous crowdwork — every annotator is known, trained, and accountable
[N]Annotated and post-edited items delivered
The doctrine

Agreement is not accuracy.

Consensus among the unqualified measures only itself. Accuracy is judgment, qualified — then measured, adjudicated, and documented.

The work

Labels for training and validation, made in the language.

One principle governs everything below: annotation happens in the language — never the industry shortcut of translating to English and labeling the translation, which destroys exactly the dialect, register, and cultural signal the label exists to capture.

Classification & labeling.
Intent, topic, sentiment, and safety or policy labels on Afghan-language text — judged in the register and dialect band the text was written in.
Translation-quality annotation.
Segment-level quality labels and MQM-style error typing — accuracy, fluency, terminology, register, and cultural-validity errors, each categorized, not just flagged.
RLHF & preference data.
Rankings, ratings, and rubric-based preference judgments over Afghan-language model outputs; instruction-following evaluation by people who can tell compliant from fluent.
Span & structure annotation.
Named entities, relations, and linking — with Afghan naming conventions and transliteration variants handled as the first-class problem they are.
Speech annotation.
Transcription, dialect-band tagging, and speaker attributes — the labeled substrate behind the firm's ASR/TTS reference work.
Safety & cultural-validity labeling.
The labels behind the firm's hallucination and trust-and-safety audit work — harm, appropriateness, and cultural-validity judgments that require exactly the qualification a crowd cannot supply.
The second craft

Correction at scale — and correction as signal.

Machine-translation post-editing.
Light and full post-editing to written specification, in the dialect band the audience actually speaks — throughput with governance, not instead of it.
LLM-output post-editing.
Model-generated Afghan-language content corrected to deliverable standard — the human pass that makes generation usable in institutions.
Error-annotated corpora.
Every correction can be typed against the error taxonomy as it is made — so post-editing does not just fix the output; it produces the training and evaluation data that fixes the model.

Most vendors throw the edits away. Here, the edits are the asset.

The workforce

The Factory runs on the Collective.

Who they are
Native speakers, qualified.Annotators are members of the Human Intelligence Collective, qualified under the Expert Network Standards — known, verified, and accountable, never anonymous.
Matched by dialect band.Assignments respect the Language Stack's second axis; a label made in the wrong band is a quiet error no aggregation will catch.
Trained per project.Calibrated on the project's versioned guidelines before production begins — and recalibrated when the guidelines move.
On what terms
Fair professional engagement.Annotators are engaged and compensated as the skilled professionals they are. The Collective's never-extractive principle applies to data work explicitly, because data work is where the industry violates it most.
No penny-per-label model.The anonymous microwork economy is structurally incompatible with quality in these languages — and with the firm's ethics. It is not used. Ever.
Accountability with protection.Annotators are known to the firm and accountable for their judgments — and covered by the same identity-protection and right-of-refusal practices as every member of the Collective.
The machinery

Quality is a system, not a vote.

01
Versioned guidelines.Annotation guidelines are written, versioned artifacts of the Lapis Stack — auditable by the client, improvable by the work.
02
Calibration before production.No annotator labels live data until calibration rounds demonstrate guideline alignment.
03
Agreement, interpreted.Inter-annotator agreement is measured on every project — among qualified annotators, where it means something — and disagreement is routed to adjudication as signal, never averaged away as noise.
04
Senior adjudication.Disagreements are resolved by senior linguists with documented rationale; the rationale flows back into the guidelines, so the system learns.
05
Gold and audit sets, properly made.Reference items authored and maintained by the firm's own senior linguists — because someone qualified has to make the gold.

Every delivery ships with its quality record: inter-annotator agreement, adjudication rate, gold-set accuracy, and the guideline version it was produced under — then passes the Five-Gate Validation Protocol™ like everything the firm ships.

In practice

Data that arrives with its evidence attached.

Engagements begin with the guideline, not the queue: your taxonomy and edge cases turned into a versioned document both sides can hold, then calibrated against before a single production label exists. From there, the function runs like infrastructure — throughput you can plan on, in languages you cannot staff — and every delivery lands with its quality record attached, so your data team audits numbers instead of trusting adjectives. When the work surfaces something your taxonomy did not anticipate, that is signal too; it comes back to you as a documented guideline question, not a silent guess.

The metrics, attached. Agreement, adjudication, and gold accuracy shipped with every delivery — auditable, not asserted.

Guidelines you can read. The versioned instructions your data was produced under, available to your team.

Known annotators. Qualified, accountable people — with the qualification regime published.

Ethically produced data. A supply chain your procurement and responsible-AI reviews can sign — no anonymous microwork, anywhere in it.

24Afghan languages & dialect bands
0Security incidents
100%Senior-led engagements
41+Trust Center documents
Request a briefing

Stop averaging error. Start measuring judgment.

For AI labs, model developers, and data teams whose Afghan-language coverage deserves better than the crowd. Briefings are conducted under NDA, in Washington, D.C. or virtually.

Request a confidential briefing

Every inquiry is received as a confidential institutional matter.