2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Governance: Training Data Documentation Requirements for High-Risk AI (August 2026)

Post #1 in the EU AI Act Data Governance Sprint 2026 — 5-part series on Art.10 compliance for high-risk AI providers

EU AI Act Art.10 data governance training data documentation requirements visualization

Sixty days before the August 2026 EU AI Act compliance deadline, most high-risk AI providers have addressed the visible obligations: risk management systems (Art.9), logging infrastructure (Art.12), transparency notices (Art.13). What many have not addressed is Art.10 — data and data governance.

Art.10 is operationally demanding in a way that differs from other Articles. It does not ask you to implement a system. It asks you to document choices you already made about your training data — and to have done so in a way that survives a National Competent Authority (NCA) audit. If you trained your model two years ago and have no contemporaneous data governance records, you have an Art.10 problem.

This post covers exactly what documentation Art.10 requires, why the documentation bar is higher than most developers expect, and how to reconstruct or create the required artefacts before August 2026.

What Art.10 Requires: The Five Core Obligations

Art.10 of the EU AI Act establishes data and data governance requirements for high-risk AI systems. Unlike the AI Act's conformity assessment provisions (Art.43), Art.10 obligations apply continuously — not just at the point of market placement.

The five core obligations are:

1. Data Governance and Management Practices

Providers must subject training, validation, and testing datasets to appropriate data governance and management practices. This is a principles-based obligation — the Act does not specify a particular framework — but the NCA audit will ask you to demonstrate what your governance practices were. A retrospective claim that "we followed best practices" without records is insufficient.

What this means in practice:

A documented data management policy that was in place during dataset curation
Records showing who had access to training data and under what conditions
Version control for datasets (or equivalent provenance records)
Documented decision points: why a particular dataset was selected, why another was excluded

2. Training Data Quality Criteria

The Act requires training, validation, and testing datasets to be:

Relevant to the task the AI system performs
Sufficiently representative of the population or context the system will operate in
To the best extent possible, free of errors
Complete — not missing categories of data that are material to the system's function

Each criterion requires documentation. "We believe the data was representative" is not evidence. An NCA will look for:

A written description of how representativeness was assessed
Statistics on class balance, demographic coverage, or domain coverage (depending on the use case)
Any known data gaps and the rationale for proceeding despite them
Error detection procedures applied to the dataset and their results

3. Statistical Properties Appropriate to Intended Use

The dataset must have statistical properties appropriate for the specific geographical, contextual, behavioural, or functional setting in which the system is intended to be used. This is a localisation requirement that catches a common failure: training on global or US-market data and deploying in the EU without adjustment.

Required documentation:

The intended deployment context (geography, user population, operational environment)
Statistical analysis showing the training data matches this context
Any sub-population analysis (if the system is used differently across user segments, each segment's data coverage must be documented)

4. Bias Examination

Art.10 requires providers to examine training, validation, and testing datasets for possible biases that could lead to risks covered by Chapter III Section 2 of the Act (the high-risk AI framework). This includes biases related to protected characteristics under EU fundamental rights law.

The examination must be:

Documented: what was tested, when, by whom
Methodologically described: what bias metrics were used and why they were chosen
Results-recorded: what biases were found (including negative results — "no bias detected" must be backed by the test records, not just the assertion)
Linked to mitigations: if biases were found, what was done? Resampling, reweighting, data augmentation, model adjustments?

Unexamined data is not compliant data. The obligation is to examine, not merely to avoid bias.

5. Special Categories Documentation

Where the examination for biases requires processing of sensitive personal data categories (for example, processing racial or ethnic origin data to detect demographic bias), the Act provides a specific legal basis with additional safeguards. This processing must be documented with appropriate data protection controls, and the processing must be limited strictly to bias detection purposes.

If your bias examination required any sensitive data categories, your documentation must cover:

The legal basis used
The data minimisation measures applied
The retention and deletion timeline for sensitive data used in bias testing

The Documentation Gap: Why Most Teams Are Not Ready

The most common Art.10 failure pattern is not that providers ran models on bad data — it is that providers cannot prove they ran models on good data.

Consider the typical ML team workflow circa 2023–2024:

Data scientist sources training data from a mix of internal systems and external datasets
Preprocessing and cleaning happens in notebooks with limited version control
Model training occurs, parameters are logged in MLflow or Weights & Biases
Model is evaluated on a validation set
Model is deployed

What this produces: excellent model metrics, a reproducible training pipeline, and virtually no Art.10-compliant documentation. MLflow logs capture training runs, not data governance choices. The rationale for dataset selection exists only in the data scientist's memory or in informal Slack conversations.

The documentation that Art.10 requires must exist before the system is placed on the market or put into service. Reconstruction is legally permissible (the Act does not require contemporaneous records — it requires the records to exist before market placement) but it must accurately reflect actual decisions, not post-hoc rationalisations.

Required Documentation Artefacts

Dataset Card (Mandatory)

A dataset card is the primary Art.10 documentation artefact. It should cover:

## Dataset: [Name and Version]

### Purpose and Task Context
- Intended AI task: [classification/regression/generation/etc.]
- Deployment context: [geography, sector, user population]
- Why this dataset was selected over alternatives

### Coverage and Representativeness
- Total samples: [N]
- Time range of data: [from → to]
- Geographic coverage: [countries/regions]
- Demographic coverage: [if applicable — age, gender, ethnicity distribution]
- Known gaps: [what the dataset does not cover and why]

### Quality Assessment
- Error detection methods applied: [describe]
- Error rate found: [X%]
- Errors removed: [Y samples removed for reason Z]
- Residual error estimate: [%]

### Data Lineage
- Primary sources: [source 1, source 2]
- Preprocessing steps: [step 1, step 2]
- Dataset version: [hash or version identifier]
- Curation date: [YYYY-MM-DD]
- Responsible team/individual: [name/role]

### Bias Examination
- Bias metrics used: [demographic parity, equalised odds, etc.]
- Protected attributes examined: [age, gender, nationality, etc.]
- Examination date: [YYYY-MM-DD]
- Results: [findings, including negative results]
- Mitigations applied: [if any]

### Special Categories Processing (if applicable)
- Sensitive data categories processed for bias examination: [specify]
- Legal basis: [Art.9(2) GDPR — scientific research / substantial public interest]
- Data minimisation measures: [describe]
- Retention: [delete after bias examination, on [date]]

Data Governance Policy (Mandatory)

A written policy describing:

Who is responsible for data quality decisions
What data sources are permissible and impermissible
How third-party dataset licences are evaluated
How data version control is maintained
How dataset updates are propagated to the model

This policy need not be complex — two pages is sufficient — but it must exist and must have been in place during the data curation period.

Bias Examination Report (Mandatory)

Separate from the dataset card, a standalone report covering:

Scope of the examination (which datasets, which protected attributes)
Methodology (specific metrics, statistical tests, tools used)
Results with supporting statistics
Decision rationale: if biases were found and mitigations were applied, why was the chosen mitigation selected? If biases were found and no mitigation was applied, what was the justification?

The NCA audit may focus specifically on this report because bias in high-risk AI systems is a primary regulatory concern.

Data-Model Linkage Record

A record that links specific dataset versions to specific model versions. This prevents the common situation where multiple model versions were trained on variants of a dataset and it becomes impossible to reconstruct which data produced which deployed model.

At minimum: a table of [model_version, dataset_version, training_date, dataset_card_reference].

How Art.10 Connects to Other Articles

Art.10 does not stand alone. Its documentation feeds directly into other compliance requirements:

Annex IV (Technical Documentation): The technical documentation that providers must maintain under Art.11 and Annex IV explicitly requires information about training data, including validation and testing methodologies. Your Art.10 artefacts are inputs to your Annex IV technical file.

Art.9 (Risk Management System): The RMS must account for risks that could arise from training data — including risks from biased data. Your bias examination report under Art.10 should be referenced in your RMS. An RMS that does not address data risks is incomplete.

Art.12 (Record-Keeping): Art.12 logging requirements include logging events that could relate to data quality issues discovered post-deployment. Art.10 documentation defines the baseline; Art.12 logs deviations from that baseline discovered in production.

Art.15 (Accuracy, Robustness, Cybersecurity): Performance metrics declared under Art.15 must be substantiated by the validation dataset referenced in your Art.10 documentation. If your accuracy claims rest on a validation set that was not properly documented, your Art.15 declaration is unsupported.

August 2026 Readiness: What to Do in the Next 60 Days

If your training data documentation is incomplete, the practical remediation path is:

Week 1–2: Audit existing documentation

Inventory all datasets used in training, validation, and testing
Identify what documentation already exists (MLflow logs, data source records, preprocessing notebooks)
Map the gaps against the dataset card template above

Week 3–4: Reconstruct documentation

Produce dataset cards for all training datasets, using available evidence (data source records, preprocessing scripts, evaluation logs)
Reconstruct the bias examination record from whatever bias testing was done, even informally
If no bias examination was done: conduct one now and document it properly

Week 5–6: Formalise governance

Write the data governance policy (the policy can be written now and can acknowledge that it codifies previous practice)
Establish the data-model linkage record
Ensure future model updates will produce contemporaneous documentation

Week 7–8: Review and integrate

Cross-reference Art.10 documentation with Annex IV technical file
Ensure RMS references data risks documented in bias examination report
Legal review of bias examination special-categories processing (if applicable)

What NCAs Will Look For in an Audit

Based on the regulatory framework and the guidance published by the European AI Office, NCA data audits for high-risk AI are expected to focus on:

Documentation existence and completeness — Can the provider produce the records? Are they comprehensive or superficial?
Documentation contemporaneity — Were records created when decisions were made, or are they clearly post-hoc? (Reconstruction is not prohibited but obvious post-hoc rationalisation is a red flag.)
Bias examination depth — Was the examination systematic and methodology-driven, or was it cursory?
Linkage between data and deployment — Can the provider trace which specific dataset version is behind the currently deployed model?
Special categories handling — If sensitive data was processed for bias examination, was it handled in compliance with GDPR Art.9 requirements?

Compliance Checklist: Art.10 Data Governance

Use this before August 2026:

Dataset card exists for every training, validation, and testing dataset
Dataset card covers: purpose, coverage, quality assessment, data lineage, bias examination, and (if applicable) special categories processing
Data governance policy is written and applies to all future data curation
Data-model linkage record links current deployed model to specific dataset versions
Bias examination report exists with documented methodology, results, and mitigations
Bias examination report is referenced in the Art.9 Risk Management System
Art.10 documentation is part of the Annex IV technical file
Retention and deletion timeline for any sensitive data used in bias testing is documented
Data governance documentation is stored in a durable, audit-accessible location (not just developer laptops or informal wikis)
Documentation covers all datasets used across the model lifecycle, not just the final training run

What's Next in This Series

This is post #1/5 in the EU AI Act Data Governance Sprint 2026. The remaining posts in this series:

Post #2: Dataset Diversity and Bias Testing — how to audit training data for EU AI Act compliance, what bias metrics to use, and how to document results
Post #3: Data Provenance Logging — tracking training data origin, transformations, and governance records across the model lifecycle
Post #4: Data Governance CI/CD Gates — automated training data compliance checks for high-risk AI pipelines
Post #5: Art.10 Data Governance Finale — complete training data compliance checklist before August 2026

Art.10 compliance is less about technology than about documentation discipline. The goal is not to have perfect data — it is to have documented, examined, and governed data. The August 2026 deadline is sixty days away.

sota.io is an EU-native managed PaaS for deploying high-risk AI systems with the infrastructure controls that Art.12 logging, Art.15 accuracy monitoring, and Art.10 data governance require. Built in Germany, no US parent, no CLOUD Act exposure. Start with sota.io →

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing