2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Governance: Training Data Documentation Requirements for High-Risk AI (August 2026)

Post #1 in the EU AI Act Data Governance Sprint 2026 — 5-part series on Art.10 compliance for high-risk AI providers

EU AI Act Art.10 data governance training data documentation requirements visualization

Sixty days before the August 2026 EU AI Act compliance deadline, most high-risk AI providers have addressed the visible obligations: risk management systems (Art.9), logging infrastructure (Art.12), transparency notices (Art.13). What many have not addressed is Art.10 — data and data governance.

Art.10 is operationally demanding in a way that differs from other Articles. It does not ask you to implement a system. It asks you to document choices you already made about your training data — and to have done so in a way that survives a National Competent Authority (NCA) audit. If you trained your model two years ago and have no contemporaneous data governance records, you have an Art.10 problem.

This post covers exactly what documentation Art.10 requires, why the documentation bar is higher than most developers expect, and how to reconstruct or create the required artefacts before August 2026.


What Art.10 Requires: The Five Core Obligations

Art.10 of the EU AI Act establishes data and data governance requirements for high-risk AI systems. Unlike the AI Act's conformity assessment provisions (Art.43), Art.10 obligations apply continuously — not just at the point of market placement.

The five core obligations are:

1. Data Governance and Management Practices

Providers must subject training, validation, and testing datasets to appropriate data governance and management practices. This is a principles-based obligation — the Act does not specify a particular framework — but the NCA audit will ask you to demonstrate what your governance practices were. A retrospective claim that "we followed best practices" without records is insufficient.

What this means in practice:

2. Training Data Quality Criteria

The Act requires training, validation, and testing datasets to be:

Each criterion requires documentation. "We believe the data was representative" is not evidence. An NCA will look for:

3. Statistical Properties Appropriate to Intended Use

The dataset must have statistical properties appropriate for the specific geographical, contextual, behavioural, or functional setting in which the system is intended to be used. This is a localisation requirement that catches a common failure: training on global or US-market data and deploying in the EU without adjustment.

Required documentation:

4. Bias Examination

Art.10 requires providers to examine training, validation, and testing datasets for possible biases that could lead to risks covered by Chapter III Section 2 of the Act (the high-risk AI framework). This includes biases related to protected characteristics under EU fundamental rights law.

The examination must be:

Unexamined data is not compliant data. The obligation is to examine, not merely to avoid bias.

5. Special Categories Documentation

Where the examination for biases requires processing of sensitive personal data categories (for example, processing racial or ethnic origin data to detect demographic bias), the Act provides a specific legal basis with additional safeguards. This processing must be documented with appropriate data protection controls, and the processing must be limited strictly to bias detection purposes.

If your bias examination required any sensitive data categories, your documentation must cover:


The Documentation Gap: Why Most Teams Are Not Ready

The most common Art.10 failure pattern is not that providers ran models on bad data — it is that providers cannot prove they ran models on good data.

Consider the typical ML team workflow circa 2023–2024:

  1. Data scientist sources training data from a mix of internal systems and external datasets
  2. Preprocessing and cleaning happens in notebooks with limited version control
  3. Model training occurs, parameters are logged in MLflow or Weights & Biases
  4. Model is evaluated on a validation set
  5. Model is deployed

What this produces: excellent model metrics, a reproducible training pipeline, and virtually no Art.10-compliant documentation. MLflow logs capture training runs, not data governance choices. The rationale for dataset selection exists only in the data scientist's memory or in informal Slack conversations.

The documentation that Art.10 requires must exist before the system is placed on the market or put into service. Reconstruction is legally permissible (the Act does not require contemporaneous records — it requires the records to exist before market placement) but it must accurately reflect actual decisions, not post-hoc rationalisations.


Required Documentation Artefacts

Dataset Card (Mandatory)

A dataset card is the primary Art.10 documentation artefact. It should cover:

## Dataset: [Name and Version]

### Purpose and Task Context
- Intended AI task: [classification/regression/generation/etc.]
- Deployment context: [geography, sector, user population]
- Why this dataset was selected over alternatives

### Coverage and Representativeness
- Total samples: [N]
- Time range of data: [from → to]
- Geographic coverage: [countries/regions]
- Demographic coverage: [if applicable — age, gender, ethnicity distribution]
- Known gaps: [what the dataset does not cover and why]

### Quality Assessment
- Error detection methods applied: [describe]
- Error rate found: [X%]
- Errors removed: [Y samples removed for reason Z]
- Residual error estimate: [%]

### Data Lineage
- Primary sources: [source 1, source 2]
- Preprocessing steps: [step 1, step 2]
- Dataset version: [hash or version identifier]
- Curation date: [YYYY-MM-DD]
- Responsible team/individual: [name/role]

### Bias Examination
- Bias metrics used: [demographic parity, equalised odds, etc.]
- Protected attributes examined: [age, gender, nationality, etc.]
- Examination date: [YYYY-MM-DD]
- Results: [findings, including negative results]
- Mitigations applied: [if any]

### Special Categories Processing (if applicable)
- Sensitive data categories processed for bias examination: [specify]
- Legal basis: [Art.9(2) GDPR — scientific research / substantial public interest]
- Data minimisation measures: [describe]
- Retention: [delete after bias examination, on [date]]

Data Governance Policy (Mandatory)

A written policy describing:

This policy need not be complex — two pages is sufficient — but it must exist and must have been in place during the data curation period.

Bias Examination Report (Mandatory)

Separate from the dataset card, a standalone report covering:

The NCA audit may focus specifically on this report because bias in high-risk AI systems is a primary regulatory concern.

Data-Model Linkage Record

A record that links specific dataset versions to specific model versions. This prevents the common situation where multiple model versions were trained on variants of a dataset and it becomes impossible to reconstruct which data produced which deployed model.

At minimum: a table of [model_version, dataset_version, training_date, dataset_card_reference].


Art.10 does not stand alone. Its documentation feeds directly into other compliance requirements:

Annex IV (Technical Documentation): The technical documentation that providers must maintain under Art.11 and Annex IV explicitly requires information about training data, including validation and testing methodologies. Your Art.10 artefacts are inputs to your Annex IV technical file.

Art.9 (Risk Management System): The RMS must account for risks that could arise from training data — including risks from biased data. Your bias examination report under Art.10 should be referenced in your RMS. An RMS that does not address data risks is incomplete.

Art.12 (Record-Keeping): Art.12 logging requirements include logging events that could relate to data quality issues discovered post-deployment. Art.10 documentation defines the baseline; Art.12 logs deviations from that baseline discovered in production.

Art.15 (Accuracy, Robustness, Cybersecurity): Performance metrics declared under Art.15 must be substantiated by the validation dataset referenced in your Art.10 documentation. If your accuracy claims rest on a validation set that was not properly documented, your Art.15 declaration is unsupported.


August 2026 Readiness: What to Do in the Next 60 Days

If your training data documentation is incomplete, the practical remediation path is:

Week 1–2: Audit existing documentation

Week 3–4: Reconstruct documentation

Week 5–6: Formalise governance

Week 7–8: Review and integrate


What NCAs Will Look For in an Audit

Based on the regulatory framework and the guidance published by the European AI Office, NCA data audits for high-risk AI are expected to focus on:

  1. Documentation existence and completeness — Can the provider produce the records? Are they comprehensive or superficial?

  2. Documentation contemporaneity — Were records created when decisions were made, or are they clearly post-hoc? (Reconstruction is not prohibited but obvious post-hoc rationalisation is a red flag.)

  3. Bias examination depth — Was the examination systematic and methodology-driven, or was it cursory?

  4. Linkage between data and deployment — Can the provider trace which specific dataset version is behind the currently deployed model?

  5. Special categories handling — If sensitive data was processed for bias examination, was it handled in compliance with GDPR Art.9 requirements?


Compliance Checklist: Art.10 Data Governance

Use this before August 2026:


What's Next in This Series

This is post #1/5 in the EU AI Act Data Governance Sprint 2026. The remaining posts in this series:

Art.10 compliance is less about technology than about documentation discipline. The goal is not to have perfect data — it is to have documented, examined, and governed data. The August 2026 deadline is sixty days away.


sota.io is an EU-native managed PaaS for deploying high-risk AI systems with the infrastructure controls that Art.12 logging, Art.15 accuracy monitoring, and Art.10 data governance require. Built in Germany, no US parent, no CLOUD Act exposure. Start with sota.io →

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.