EU AI Act Art.10 Data Governance: Training Data Documentation Requirements for High-Risk AI (August 2026)
Post #1 in the EU AI Act Data Governance Sprint 2026 — 5-part series on Art.10 compliance for high-risk AI providers
Sixty days before the August 2026 EU AI Act compliance deadline, most high-risk AI providers have addressed the visible obligations: risk management systems (Art.9), logging infrastructure (Art.12), transparency notices (Art.13). What many have not addressed is Art.10 — data and data governance.
Art.10 is operationally demanding in a way that differs from other Articles. It does not ask you to implement a system. It asks you to document choices you already made about your training data — and to have done so in a way that survives a National Competent Authority (NCA) audit. If you trained your model two years ago and have no contemporaneous data governance records, you have an Art.10 problem.
This post covers exactly what documentation Art.10 requires, why the documentation bar is higher than most developers expect, and how to reconstruct or create the required artefacts before August 2026.
What Art.10 Requires: The Five Core Obligations
Art.10 of the EU AI Act establishes data and data governance requirements for high-risk AI systems. Unlike the AI Act's conformity assessment provisions (Art.43), Art.10 obligations apply continuously — not just at the point of market placement.
The five core obligations are:
1. Data Governance and Management Practices
Providers must subject training, validation, and testing datasets to appropriate data governance and management practices. This is a principles-based obligation — the Act does not specify a particular framework — but the NCA audit will ask you to demonstrate what your governance practices were. A retrospective claim that "we followed best practices" without records is insufficient.
What this means in practice:
- A documented data management policy that was in place during dataset curation
- Records showing who had access to training data and under what conditions
- Version control for datasets (or equivalent provenance records)
- Documented decision points: why a particular dataset was selected, why another was excluded
2. Training Data Quality Criteria
The Act requires training, validation, and testing datasets to be:
- Relevant to the task the AI system performs
- Sufficiently representative of the population or context the system will operate in
- To the best extent possible, free of errors
- Complete — not missing categories of data that are material to the system's function
Each criterion requires documentation. "We believe the data was representative" is not evidence. An NCA will look for:
- A written description of how representativeness was assessed
- Statistics on class balance, demographic coverage, or domain coverage (depending on the use case)
- Any known data gaps and the rationale for proceeding despite them
- Error detection procedures applied to the dataset and their results
3. Statistical Properties Appropriate to Intended Use
The dataset must have statistical properties appropriate for the specific geographical, contextual, behavioural, or functional setting in which the system is intended to be used. This is a localisation requirement that catches a common failure: training on global or US-market data and deploying in the EU without adjustment.
Required documentation:
- The intended deployment context (geography, user population, operational environment)
- Statistical analysis showing the training data matches this context
- Any sub-population analysis (if the system is used differently across user segments, each segment's data coverage must be documented)
4. Bias Examination
Art.10 requires providers to examine training, validation, and testing datasets for possible biases that could lead to risks covered by Chapter III Section 2 of the Act (the high-risk AI framework). This includes biases related to protected characteristics under EU fundamental rights law.
The examination must be:
- Documented: what was tested, when, by whom
- Methodologically described: what bias metrics were used and why they were chosen
- Results-recorded: what biases were found (including negative results — "no bias detected" must be backed by the test records, not just the assertion)
- Linked to mitigations: if biases were found, what was done? Resampling, reweighting, data augmentation, model adjustments?
Unexamined data is not compliant data. The obligation is to examine, not merely to avoid bias.
5. Special Categories Documentation
Where the examination for biases requires processing of sensitive personal data categories (for example, processing racial or ethnic origin data to detect demographic bias), the Act provides a specific legal basis with additional safeguards. This processing must be documented with appropriate data protection controls, and the processing must be limited strictly to bias detection purposes.
If your bias examination required any sensitive data categories, your documentation must cover:
- The legal basis used
- The data minimisation measures applied
- The retention and deletion timeline for sensitive data used in bias testing
The Documentation Gap: Why Most Teams Are Not Ready
The most common Art.10 failure pattern is not that providers ran models on bad data — it is that providers cannot prove they ran models on good data.
Consider the typical ML team workflow circa 2023–2024:
- Data scientist sources training data from a mix of internal systems and external datasets
- Preprocessing and cleaning happens in notebooks with limited version control
- Model training occurs, parameters are logged in MLflow or Weights & Biases
- Model is evaluated on a validation set
- Model is deployed
What this produces: excellent model metrics, a reproducible training pipeline, and virtually no Art.10-compliant documentation. MLflow logs capture training runs, not data governance choices. The rationale for dataset selection exists only in the data scientist's memory or in informal Slack conversations.
The documentation that Art.10 requires must exist before the system is placed on the market or put into service. Reconstruction is legally permissible (the Act does not require contemporaneous records — it requires the records to exist before market placement) but it must accurately reflect actual decisions, not post-hoc rationalisations.
Required Documentation Artefacts
Dataset Card (Mandatory)
A dataset card is the primary Art.10 documentation artefact. It should cover:
## Dataset: [Name and Version]
### Purpose and Task Context
- Intended AI task: [classification/regression/generation/etc.]
- Deployment context: [geography, sector, user population]
- Why this dataset was selected over alternatives
### Coverage and Representativeness
- Total samples: [N]
- Time range of data: [from → to]
- Geographic coverage: [countries/regions]
- Demographic coverage: [if applicable — age, gender, ethnicity distribution]
- Known gaps: [what the dataset does not cover and why]
### Quality Assessment
- Error detection methods applied: [describe]
- Error rate found: [X%]
- Errors removed: [Y samples removed for reason Z]
- Residual error estimate: [%]
### Data Lineage
- Primary sources: [source 1, source 2]
- Preprocessing steps: [step 1, step 2]
- Dataset version: [hash or version identifier]
- Curation date: [YYYY-MM-DD]
- Responsible team/individual: [name/role]
### Bias Examination
- Bias metrics used: [demographic parity, equalised odds, etc.]
- Protected attributes examined: [age, gender, nationality, etc.]
- Examination date: [YYYY-MM-DD]
- Results: [findings, including negative results]
- Mitigations applied: [if any]
### Special Categories Processing (if applicable)
- Sensitive data categories processed for bias examination: [specify]
- Legal basis: [Art.9(2) GDPR — scientific research / substantial public interest]
- Data minimisation measures: [describe]
- Retention: [delete after bias examination, on [date]]
Data Governance Policy (Mandatory)
A written policy describing:
- Who is responsible for data quality decisions
- What data sources are permissible and impermissible
- How third-party dataset licences are evaluated
- How data version control is maintained
- How dataset updates are propagated to the model
This policy need not be complex — two pages is sufficient — but it must exist and must have been in place during the data curation period.
Bias Examination Report (Mandatory)
Separate from the dataset card, a standalone report covering:
- Scope of the examination (which datasets, which protected attributes)
- Methodology (specific metrics, statistical tests, tools used)
- Results with supporting statistics
- Decision rationale: if biases were found and mitigations were applied, why was the chosen mitigation selected? If biases were found and no mitigation was applied, what was the justification?
The NCA audit may focus specifically on this report because bias in high-risk AI systems is a primary regulatory concern.
Data-Model Linkage Record
A record that links specific dataset versions to specific model versions. This prevents the common situation where multiple model versions were trained on variants of a dataset and it becomes impossible to reconstruct which data produced which deployed model.
At minimum: a table of [model_version, dataset_version, training_date, dataset_card_reference].
How Art.10 Connects to Other Articles
Art.10 does not stand alone. Its documentation feeds directly into other compliance requirements:
Annex IV (Technical Documentation): The technical documentation that providers must maintain under Art.11 and Annex IV explicitly requires information about training data, including validation and testing methodologies. Your Art.10 artefacts are inputs to your Annex IV technical file.
Art.9 (Risk Management System): The RMS must account for risks that could arise from training data — including risks from biased data. Your bias examination report under Art.10 should be referenced in your RMS. An RMS that does not address data risks is incomplete.
Art.12 (Record-Keeping): Art.12 logging requirements include logging events that could relate to data quality issues discovered post-deployment. Art.10 documentation defines the baseline; Art.12 logs deviations from that baseline discovered in production.
Art.15 (Accuracy, Robustness, Cybersecurity): Performance metrics declared under Art.15 must be substantiated by the validation dataset referenced in your Art.10 documentation. If your accuracy claims rest on a validation set that was not properly documented, your Art.15 declaration is unsupported.
August 2026 Readiness: What to Do in the Next 60 Days
If your training data documentation is incomplete, the practical remediation path is:
Week 1–2: Audit existing documentation
- Inventory all datasets used in training, validation, and testing
- Identify what documentation already exists (MLflow logs, data source records, preprocessing notebooks)
- Map the gaps against the dataset card template above
Week 3–4: Reconstruct documentation
- Produce dataset cards for all training datasets, using available evidence (data source records, preprocessing scripts, evaluation logs)
- Reconstruct the bias examination record from whatever bias testing was done, even informally
- If no bias examination was done: conduct one now and document it properly
Week 5–6: Formalise governance
- Write the data governance policy (the policy can be written now and can acknowledge that it codifies previous practice)
- Establish the data-model linkage record
- Ensure future model updates will produce contemporaneous documentation
Week 7–8: Review and integrate
- Cross-reference Art.10 documentation with Annex IV technical file
- Ensure RMS references data risks documented in bias examination report
- Legal review of bias examination special-categories processing (if applicable)
What NCAs Will Look For in an Audit
Based on the regulatory framework and the guidance published by the European AI Office, NCA data audits for high-risk AI are expected to focus on:
-
Documentation existence and completeness — Can the provider produce the records? Are they comprehensive or superficial?
-
Documentation contemporaneity — Were records created when decisions were made, or are they clearly post-hoc? (Reconstruction is not prohibited but obvious post-hoc rationalisation is a red flag.)
-
Bias examination depth — Was the examination systematic and methodology-driven, or was it cursory?
-
Linkage between data and deployment — Can the provider trace which specific dataset version is behind the currently deployed model?
-
Special categories handling — If sensitive data was processed for bias examination, was it handled in compliance with GDPR Art.9 requirements?
Compliance Checklist: Art.10 Data Governance
Use this before August 2026:
- Dataset card exists for every training, validation, and testing dataset
- Dataset card covers: purpose, coverage, quality assessment, data lineage, bias examination, and (if applicable) special categories processing
- Data governance policy is written and applies to all future data curation
- Data-model linkage record links current deployed model to specific dataset versions
- Bias examination report exists with documented methodology, results, and mitigations
- Bias examination report is referenced in the Art.9 Risk Management System
- Art.10 documentation is part of the Annex IV technical file
- Retention and deletion timeline for any sensitive data used in bias testing is documented
- Data governance documentation is stored in a durable, audit-accessible location (not just developer laptops or informal wikis)
- Documentation covers all datasets used across the model lifecycle, not just the final training run
What's Next in This Series
This is post #1/5 in the EU AI Act Data Governance Sprint 2026. The remaining posts in this series:
- Post #2: Dataset Diversity and Bias Testing — how to audit training data for EU AI Act compliance, what bias metrics to use, and how to document results
- Post #3: Data Provenance Logging — tracking training data origin, transformations, and governance records across the model lifecycle
- Post #4: Data Governance CI/CD Gates — automated training data compliance checks for high-risk AI pipelines
- Post #5: Art.10 Data Governance Finale — complete training data compliance checklist before August 2026
Art.10 compliance is less about technology than about documentation discipline. The goal is not to have perfect data — it is to have documented, examined, and governed data. The August 2026 deadline is sixty days away.
sota.io is an EU-native managed PaaS for deploying high-risk AI systems with the infrastructure controls that Art.12 logging, Art.15 accuracy monitoring, and Art.10 data governance require. Built in Germany, no US parent, no CLOUD Act exposure. Start with sota.io →
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.