Home/Operating Model/AI Data Factory/Bilingual Glossary Maintenance

Operating Model · AI Data Factory

Bilingual Glossary Maintenance

Much of your institutional vocabulary has no settled Pashto term. There is only what each translator decided, alone, on deadline.

Bilingual Glossary Maintenance is the AI Data Factory’s terminology-governance function: continuously curated bilingual lexicons across the 24-language coverage map — every entry a documented decision with provenance, dialect-band variants, and version history, maintained on a curation cycle rather than archived in a spreadsheet. It is the lexical substrate every other function in this firm draws from.

Request a confidential briefing See Annotation & Post-Editing

Specimen Entry

Sourceinformed consent

Target[term — Dari · Kabuli]

Provenancesenior linguist panel · [date]

Version[v · release]

Every target-language field is a bracketed slot until the firm’s linguists supply adjudicated content.

Why Governance

When the dictionary runs out, somebody decides — the question is who, and whether it holds.

In high-resource languages, terminology is a lookup problem: the standard term exists, in a dictionary or an industry termbase, and the translator’s job is to find it. For the Afghan languages, across the institutional and technical domains this firm serves, the dictionary runs out early. The settled term for an insurance co-payment, a special-education evaluation, a phishing attempt, or a model hallucination often does not exist — or exists in three competing variants split across dialect bands and a diaspora whose usage keeps moving. Institutional vocabulary itself has shifted with Afghanistan’s institutions, and the diaspora’s usage and the country’s usage no longer move together.

So somebody decides. Under the industry default, that somebody is each translator, alone, on deadline — which is why the same English term arrives three ways across one hospital’s documents, why opposing counsel can impeach a transcript with the agency’s own earlier filings, and why a model trained on the output learns the inconsistency as if it were signal. And the spreadsheet glossary somebody built years ago does not fix this; unmaintained, it is simply the oldest of the competing decisions.

The Factory’s answer is to make terminology a governed asset: terms decided by qualified senior linguists, with the rationale recorded; variants captured by dialect band instead of collapsed; every entry versioned; and the whole of it maintained on a cycle — because a glossary is only as good as its last review.

Languages on the coverage map, paired with English

[N]

Terms under governance

100%

Of entries carry provenance: the decision, its basis, and its version

Core domains under continuous curation

The Anatomy

An entry is a decision record, not a word pair.

A line in a spreadsheet says what to write. A governed entry says that — and who decided it, on what basis, for which dialect bands, in which register, since when, and what it replaced. The difference is everything the spreadsheet cannot answer when the auditor, the court, or your own data team asks why this term.

Specimen EntryIllustrative — target-language fields bracketed pending linguist adjudication

Source term

informed consent (en)

Target term

[term — Dari · band: Kabuli]

Script & transliteration

[Perso-Arabic form] · [romanization variant(s)]

Dialect-band variants

[variant — band] · [variant — band] (recorded, not collapsed)

Domain & register

Medical & health · formal, patient-facing

Usage guidance

[when this term, not its competitors; example in context]

Provenance

decided by senior linguist panel · [date] · basis: corpus evidence, native-speaker adjudication

Status

Approved (supersedes [deprecated variant], v1.x)

Version

[v · release date]

Nine fields, one purpose: when someone asks why this term, the answer is on file.

The Coverage

Curated by domain, because register is half the term.

Medical & health

Consent, diagnosis, benefits, and patient-facing vocabulary.

Feeds: Section 1557 work, patient-access pathways, the interpreter cohort’s clinical prep.

Legal & court

Procedure, rights, and immigration vocabulary at record-grade precision.

Feeds: court language access, certified translation, expert testimony.

Government & public services

Benefits, forms, and civic terminology institutions put in front of households.

Feeds: resettlement operations, public-sector engagements.

Education

Enrollment, special-education, and family-engagement vocabulary — the IEP table’s working lexicon.

Feeds: K–12 family access, the education interpreter bench.

AI & technical

The vocabulary of the Factory’s own work — safety taxonomies, evaluation rubrics, interface strings.

Feeds: annotation guidelines, model evaluations, localization.

Humanitarian & protection

Affected-population and protection terminology, trauma-aware in register.

Feeds: NGO and multilateral programs, crisis communications.

The Cycle

A glossary is only as good as its last review.

Intake

Candidates surface from live work — a post-editor’s correction, an interpreter’s prep question, an annotation guideline gap. Every engagement feeds the lexicon; nothing learned is lost.

Adjudication

Senior linguists decide among competing variants — or coin, where the language has not yet settled — with native-speaker validation, dialect-band variants recorded, and the rationale documented into provenance.

Publication

Versioned release across the coverage map; client-layer glossaries inherit the update; deprecated terms are marked, never silently deleted.

Maintenance

Scheduled domain reviews, drift monitoring as diaspora usage moves, and expansion as new domains arrive — the cycle that separates a lexicon from a spreadsheet.

Curation runs continuously, with scheduled domain reviews [cadence confirmed at publish] — and every release passes the Five-Gate Validation Protocol, like everything the firm ships.

In Practice

One terminology, everywhere your language goes.

With governed lexicons underneath the work, consistency stops depending on memory. Your documents, your interpreters’ preparation, your training data, and your model evaluations all draw from the same source — so the term the patient hears in the consult is the term on the consent form is the term in the dataset. Your own in-house terminology onboards as a client layer over the firm’s base, governed by the same cycle. And when anyone — auditor, court, or your own reviewer — asks why a term was chosen, the provenance record answers in writing, with a date.

One source of truth

Documents, interpretation prep, annotation, and evaluation drawing the same governed lexicon.

A client layer of your own

Your house terminology onboarded, adjudicated, and maintained as a layer over the base — yours to keep.

Provenance on demand

The decision record behind any term, available when the question is asked.

Living releases

Versioned updates on a cycle — you are never working from the old spreadsheet again.