OPERATING MODEL · AI DATA FACTORY

The ADF Pipeline

The deadline will be forgotten in a quarter. The dataset will remember every shortcut taken to meet it — permanently.

The ADF Pipeline is the AI Data Factory's end-to-end production methodology: six stations from specification to release, one loop that feeds every release back into the system, and the Five-Gate Validation Protocol™ positioned along the line. Every dataset, corpus, and benchmark the Factory ships is produced on it — annotation, glossaries, speech references, and evaluations alike — with no side doors.

Exhibit 01 — The line every dataset runs
THE LOOP · EVERY RELEASE RETURNSISpecifyIISourceIIICalibrateIVProduceVValidateVIRelease
Why a pipeline

Data production in this industry is project-shaped chaos. The chaos is permanent.

Definition

The ADF Pipeline is the AI Data Factory's end-to-end production methodology: six stations — Specify, Source, Calibrate, Produce, Validate, Release — closed by a loop that returns every release's learnings to the system, with the Five-Gate Validation Protocol positioned along the line.

Most data work is run as a series of one-offs: each dataset its own improvised project, quality re-invented per engagement, provenance reconstructed afterward if anyone asks. The output looks fine — data usually does — and the problem surfaces later, downstream, where it can no longer be fixed. Because data has a property deliverables in other industries do not: its defects are inherited. The unqualified label trains the model. The scraped audio sits in the corpus. The translated test item grades every future version. Quality cannot be inspected into a dataset after the fact, any more than freshness can be inspected into a meal; it is either produced, or it is absent forever.

The Factory's answer is a single, named, versioned method — the ADF Pipeline — on which everything it ships is produced. The pipeline buys three things one-off projects structurally cannot. Reproducibility: the same six stations run the same way for a thousand labels or a hundred-hour corpus. Auditability: provenance is recorded as the work happens, not reconstructed when the diligence email arrives. And compounding: a loop returns every release's learnings to the system, so the method is better at the end of each engagement than it was at the start.

6
Stations, one method, end to end
24
Languages, produced on one pipeline
100%
Of Factory output produced on it — no side doors
5
Validation gates positioned along the line
The Doctrine

A dataset remembers how it was made.

Every shortcut survives in the asset forever. The pipeline exists so that what the data remembers is a method.

The method

Six stations. Every dataset. No exceptions.

I

Specify

Principle

The spec is the contract everything downstream is built against.

  • The engagement's data need becomes a written specification: taxonomy or guideline, corpus design, coverage targets by language and dialect band, QA thresholds, and the rights plan.
  • The standards baseline is set here — the sector and client requirements the deliverable must conform to, encoded before work begins.
II

Source

Principle

Material and people, acquired right — provenance from the first minute.

  • Collection runs through the Collective under the speaker's perimeter: consent in the contributor's language, explicit scopes, compensation, no scraped community material.
  • The workforce is assembled from qualified members of the Collective, matched by language and dialect band; client-supplied material is intaken with its provenance recorded.
III

Calibrate

Principle

No one labels live data until calibration passes.

  • Gold and reference items are authored by the firm's senior linguists; pilot rounds run against the guideline.
  • Calibration results revise the guideline before production — disagreement here is cheap; the same disagreement at scale is a defect.
IV

Produce

Principle

Throughput under live governance, not instead of it.

  • Annotation, transcription, recording, and post-editing at planned throughput, with QA sampling running continuously.
  • Disagreement is routed to adjudication as signal — never averaged away, never left for the end.
V

Validate

Principle

Judged by the qualified, recorded as it is judged.

  • Senior adjudication resolves disputes with documented rationale; references are validated; the rationale flows back into the guideline.
  • Terminology decisions surfaced by the work are fed to the glossary function — the lexicon learns from every engagement.
  • The Five-Gate Validation Protocol™ closes over the deliverable.
VI

Release

Principle

A dataset ships with its evidence, or it does not ship.

  • Every release carries its datasheet (composition, authorship, versions), its quality record (agreement, adjudication, gold accuracy), and its rights file.
  • Versioned, documented, and mapped to the Trust Center — whether delivered to a client or published into the firm's own measures.
What comes back

The pipeline ends where it begins.

A line that never feeds itself produces the same quality forever. The Pipeline closes the loop deliberately: each release returns its learnings to the system, so the next specification starts smarter than the last one did.

Corrections return as signal.

Post-editing and review findings become error-annotated data and guideline revisions — the edits are the asset.

Terminology returns to the lexicon.

Every term decided in Validate enters the governed glossaries, with provenance — nothing decided is decided twice.

Benchmarks version forward.

Evaluation instruments persist and improve, so progress stays measurable against the same ruler.

The pipeline is a circle drawn as a line.
The spine

The five gates are stations' companions, not a final stop.

At Specify

The standards baseline (Gate 3) is encoded into the spec itself.

At Source

The population-risk gate (Gate 4) governs consent, rights, and collection before a single item exists.

Through Calibrate and Produce

Linguistic accuracy and cultural validity (Gates 1–2) run continuously on the work as it is made.

At Validate and Release

All five gates close over the deliverable, sealed under the CCB Sign-Off Mark.

The canonical protocol, in full — the Five-Gate Validation Protocol™ →
In practice

One method, whether the order is small or the corpus is vast.

The pipeline is what makes the Factory predictable.

A thousand-item labeling order and a hundred-hour speech corpus run the same six stations — the stations scale; they are never skipped. Your diligence team can walk the method end to end before committing, because the method is written down. Every delivered item is traceable to who produced it, under which guideline version, adjudicated by whom. And when the timeline is brutal — timelines are sometimes brutal — the answer is acceleration inside the method: rush jobs run the pipeline faster, never around it. There are no side doors, because a side door, used once, is in the data forever.

One method, every dataset.

Six stations whether the engagement is a pilot or a program — predictability you can plan against.

Provenance end to end.

Every item traceable: producer, guideline version, adjudication record.

No side doors.

Nothing the Factory ships was produced off-pipeline. Nothing.

A method your diligence can walk.

The pipeline is documented; the audit is an exercise, not an archaeology.

Provenance, recorded

Every release maps to the Trust Center.

Datasheet, quality record, and rights file — versioned and auditable, whether the dataset is delivered to a client or published into the firm's own measures.

Visit the Trust Center →
24
Afghan languages and dialect bands
0
Security incidents
100%
Senior-led engagements
41+
Trust Center documents
The next dataset

Make your next dataset remember a method.

For AI teams and institutions whose data will outlive every deadline that shaped it. Briefings are conducted under NDA, in Washington, D.C. or virtually.