Much of your institutional vocabulary has no settled Pashto term. There is only what each translator decided, alone, on deadline.
Bilingual Glossary Maintenance is the AI Data Factory’s terminology-governance function: continuously curated bilingual lexicons across the 24-language coverage map — every entry a documented decision with provenance, dialect-band variants, and version history, maintained on a curation cycle rather than archived in a spreadsheet. It is the lexical substrate every other function in this firm draws from.
Every target-language field is a bracketed slot until the firm’s linguists supply adjudicated content.
In high-resource languages, terminology is a lookup problem: the standard term exists, in a dictionary or an industry termbase, and the translator’s job is to find it. For the Afghan languages, across the institutional and technical domains this firm serves, the dictionary runs out early. The settled term for an insurance co-payment, a special-education evaluation, a phishing attempt, or a model hallucination often does not exist — or exists in three competing variants split across dialect bands and a diaspora whose usage keeps moving. Institutional vocabulary itself has shifted with Afghanistan’s institutions, and the diaspora’s usage and the country’s usage no longer move together.
So somebody decides. Under the industry default, that somebody is each translator, alone, on deadline — which is why the same English term arrives three ways across one hospital’s documents, why opposing counsel can impeach a transcript with the agency’s own earlier filings, and why a model trained on the output learns the inconsistency as if it were signal. And the spreadsheet glossary somebody built years ago does not fix this; unmaintained, it is simply the oldest of the competing decisions.
The Factory’s answer is to make terminology a governed asset: terms decided by qualified senior linguists, with the rationale recorded; variants captured by dialect band instead of collapsed; every entry versioned; and the whole of it maintained on a cycle — because a glossary is only as good as its last review.
Terms are decided, not found.
So the decisions are made by the qualified, recorded with their rationale, varied by dialect band, and governed so they hold.
A line in a spreadsheet says what to write. A governed entry says that — and who decided it, on what basis, for which dialect bands, in which register, since when, and what it replaced. The difference is everything the spreadsheet cannot answer when the auditor, the court, or your own data team asks why this term.
Nine fields, one purpose: when someone asks why this term, the answer is on file.
Consent, diagnosis, benefits, and patient-facing vocabulary.
Feeds: Section 1557 work, patient-access pathways, the interpreter cohort’s clinical prep.
Procedure, rights, and immigration vocabulary at record-grade precision.
Feeds: court language access, certified translation, expert testimony.
Benefits, forms, and civic terminology institutions put in front of households.
Feeds: resettlement operations, public-sector engagements.
Enrollment, special-education, and family-engagement vocabulary — the IEP table’s working lexicon.
Feeds: K–12 family access, the education interpreter bench.
The vocabulary of the Factory’s own work — safety taxonomies, evaluation rubrics, interface strings.
Feeds: annotation guidelines, model evaluations, localization.
Affected-population and protection terminology, trauma-aware in register.
Feeds: NGO and multilateral programs, crisis communications.
Candidates surface from live work — a post-editor’s correction, an interpreter’s prep question, an annotation guideline gap. Every engagement feeds the lexicon; nothing learned is lost.
Senior linguists decide among competing variants — or coin, where the language has not yet settled — with native-speaker validation, dialect-band variants recorded, and the rationale documented into provenance.
Versioned release across the coverage map; client-layer glossaries inherit the update; deprecated terms are marked, never silently deleted.
Scheduled domain reviews, drift monitoring as diaspora usage moves, and expansion as new domains arrive — the cycle that separates a lexicon from a spreadsheet.
Curation runs continuously, with scheduled domain reviews [cadence confirmed at publish] — and every release passes the Five-Gate Validation Protocol, like everything the firm ships.
With governed lexicons underneath the work, consistency stops depending on memory. Your documents, your interpreters’ preparation, your training data, and your model evaluations all draw from the same source — so the term the patient hears in the consult is the term on the consent form is the term in the dataset. Your own in-house terminology onboards as a client layer over the firm’s base, governed by the same cycle. And when anyone — auditor, court, or your own reviewer — asks why a term was chosen, the provenance record answers in writing, with a date.
Documents, interpretation prep, annotation, and evaluation drawing the same governed lexicon.
Your house terminology onboarded, adjudicated, and maintained as a layer over the base — yours to keep.
The decision record behind any term, available when the question is asked.
Versioned updates on a cycle — you are never working from the old spreadsheet again.
For institutions whose Afghan-language consistency currently lives in translators’ memories and a spreadsheet from another era. Briefings are conducted under NDA, in Washington, D.C. or virtually.
Request a confidential briefing