Operating Model · Data & Infrastructure

Dataset Catalog & Licensing

The corpus was ten thousand hours and a price. It came with no proof of where it came from, and no clean right to use it.

Dataset Catalog & Licensing is where the AI Data Factory's output becomes available to license: curated Afghan linguistic and cultural datasets — speech, text, annotation, lexical, and evaluation — each carrying a datasheet, a documented consent basis, and license terms that state plainly what you may do with it. Provenanced data, with the rights to match. Not bytes by the hour.

Provenanced.
Consented.
Licensed.
Why curated
Definition

Dataset Catalog & Licensing is where the AI Data Factory's output becomes available to license: curated Afghan linguistic and cultural datasets across speech, text, annotation, lexical, and evaluation categories, spanning 24 Afghan languages and their dialect bands. Every dataset is produced or vetted through the firm's pipeline rather than scraped, and every catalog entry pairs a datasheet — documenting provenance, composition, and consent — with a license stating permitted and prohibited uses. Because a corpus is not a license: having the data is not having the right to use it, and what this catalog provides is provenanced data with the rights to match.

The data market sells you a number and stays silent on the two things that matter.

A dataset is usually marketed the way bulk commodities are: by quantity. Ten thousand hours of speech, a billion tokens of text, a million labeled examples — a figure, a format, and a price. What the listing almost never includes is the two facts a serious buyer actually needs. The first is provenance: where did this come from, how was it made, and whose data is inside it. The second is the right to use it: not the file, but the license — what you may lawfully and ethically do with the data, including whether you may train on it, and whether you may use a voice in it to synthesize speech.

For the Afghan languages, the silence is worse, because the cheap way to reach those impressive numbers is to scrape — sermons, phone-ins, social audio, and web text harvested from communities that never agreed to any of it. The buyer trains on the corpus, ships the model, and inherits everything the seller declined to mention: contamination, non-consent, unprovenanced material of unknown copyright, and no way to prove where a single hour came from when a regulator, a court, or a community eventually asks. The number was real. Everything that made the number safe to use was missing.

So this catalog is built the other way around. Every dataset is curated — produced or vetted through the firm's pipeline, not scraped — and every entry leads with the two facts the market hides: a datasheet that documents provenance and composition, and a license that states the rights. The hours are the commodity. The provenance and the license are the product.

24
languages and dialect bands represented across the catalog
100%
of datasets carrying a datasheet, documented provenance, and a consent basis
0
scraped community data in inventory — ever
5
validation gates standing behind every dataset released
The doctrine

A corpus is not a license.

Having the data is not having the right to use it. What this catalog licenses is provenanced data with the rights to match — documented, consented, and defensible when someone asks.

The catalog

Curated datasets, by category — each the documented output of a function, not a scrape.

01
Speech & ASR corpora

Transcribed Afghan-language speech designed for representation across dialect bands, registers, and acoustic conditions — with reference-grade transcription.

Produced through: ASR / TTS Reference Sets →
02
Voice & TTS corpora

Studio-grade voice-talent recordings for synthesis, with consent whose scope explicitly and separately covers synthesis.

Produced through: ASR / TTS Reference Sets →
03
Text & parallel corpora

Monolingual and aligned bilingual text for translation and language modeling, post-edited and quality-recorded.

Produced through: Annotation & Post-Editing →
04
Annotated & labeled datasets

Classification, span and entity, preference, and safety-labeled data — annotated in-language by qualified members of the Collective.

Produced through: Annotation & Post-Editing →
05
Lexical & terminology resources

Governed glossaries and pronunciation lexicons with dialect-band variants and provenance.

Produced through: Bilingual Glossary Maintenance →
06
Evaluation & benchmark sets

Native-authored, contamination-conscious, band-stratified test sets with validated references.

Produced through: Low-Resource Model Evaluations →

Specific datasets, their composition, and their measures are listed in the catalog itself; what every category shares is a datasheet, a consent basis, and a license.

The anatomy

Every entry answers the two questions a listing usually hides.

A catalog entry is not a product card with a size and a price. It is a datasheet bound to a license — provenance on one side, rights on the other — so a buyer knows both what they are getting and what they may do with it before they commit.

Datasheet · Provenance
Dataset & version
[name · version · release date]
Modality & composition
[speech / text / annotation / lexical / evaluation · size · structure]
Languages & dialect bands
[languages · bands covered]
Provenance — how it was made
produced through the ADF Pipeline · [stations · guideline versions]
Consent basis
[consent obtained; synthesis scope where applicable — explicit and separate]
Quality record
[validation, agreement, datasheet metrics]
License · Rights
License & permitted uses
[license type · research / commercial / training scope]
Prohibited uses
[e.g., unconsented synthesis, re-identification — stated, not implied]
Rights & provenance file
[reference to the retained documentation]
Permitted and prohibited uses are governed by the licensing perimeter →

Provenance on one side, rights on the other — and nothing licensed without both.

The terms, and what they will not do

A license you can defend rests on data you can account for.

Licensing data about people is not a pure commercial act; it carries an obligation to the people in the data. The terms are written to hold both the buyer's right and that obligation at once.

A provenance-backed license
Every license rests on documented provenance — you can show where the data came from and how it was made, which is exactly what makes the right to use it defensible later.
Synthesis consent, explicit and separate
No voice dataset is licensed for synthesis unless the speakers consented to synthesis specifically — a separate grant, never bundled into a general release, as the firm's speaker perimeter requires.
No scraped data, ever
The catalog contains no harvested community material — no scraped sermons, calls, broadcasts, social audio, or web text. The industry's cheap path to volume is not in inventory.
Revenue never overrides consent
A licensing fee does not unlock a use the consent did not grant. What may be offered at all is governed by Gate 4 — Population Risk — before it is governed by a price.
Licensing scopes, defined per dataset
Permitted uses — research, commercial, training, evaluation — are scoped per dataset and stated in its terms, so a buyer licenses exactly the right they need and no ambiguity they will regret.
In practice

Data your legal team can clear and your model can stand on.

For the team sourcing Afghan-language data, this catalog changes what acquiring data does to your risk. You receive datasets that arrive with their datasheet, so your own documentation inherits a provenance record instead of a blank where one should be. You license exactly the right you need — research, commercial, training, or synthesis — stated plainly, so your legal review clears a defined scope rather than guessing at an ambiguous one. The consent and rights file travels with the data, ready for the diligence question and the regulator's eventual interest. And because nothing in inventory was scraped, the model you train inherits a clean lineage rather than a liability you will discover later. This is the rare data acquisition that lowers your risk instead of importing it.

A datasheet with every dataset
Provenance and composition documented — your records inherit a lineage, not a gap.
The exact right, stated
Permitted uses scoped and written — research, commercial, training, or synthesis — clearable by your legal team.
A rights file that travels
Consent basis and provenance retained and producible when diligence asks.
A clean lineage to train on
No scraped material in inventory — the liability you would otherwise discover later is simply absent.
Security & assurance

Licensed data, held to the firm's security posture.

Every dataset in the catalog lives inside the Atlas Data Platform — access-controlled, auditable, and held to the same security and assurance discipline documented in the Trust Center. Provenance and rights records are retained and producible; the environment that serves the data is built to be examined, not taken on faith.

Explore the Trust Center →
24
Afghan languages and dialect bands
0
security incidents
100%
senior-led engagements
41+
Trust Center documents

License data you can defend.

For teams that need Afghan-language data with a provenance record and a clean right to use it — not a number and a silence. Briefings are conducted under NDA, in Washington, D.C. or virtually.

Request a confidential briefing