EU AI Act Art.10 Data Governance Finale: Complete Training Data Compliance Checklist Before August 2026
Post #5 (Finale) in the sota.io EU AI Act Data Governance Sprint — August 2026 Deadline
Sixty days from now, high-risk AI systems operating in the EU must be fully compliant with Art.10 of the EU AI Act. The deadline — August 2, 2026 — is not a filing date. It is the date from which non-compliant systems face market surveillance scrutiny, notified body audits, and penalties reaching €15 million or 3% of global annual turnover under Art.99.
This finale post consolidates everything from the four-part sprint into a single, actionable compliance checklist. Use it to audit your current data governance posture, identify gaps, and assign remediation tasks before the deadline arrives.
The Sprint So Far:
- Post 1: Art.10 foundations — what training data documentation the EU AI Act actually requires
- Post 2: Dataset diversity and bias testing — how to audit training data for compliance
- Post 3: Data provenance logging — tracking training data origin and governance records
- Post 4: CI/CD data governance gates — automating compliance checks in your pipeline
What Art.10 Covers: The Full Scope
Art.10 of the EU AI Act governs data and data governance for high-risk AI systems. It applies to providers — organisations that develop or place high-risk AI systems on the EU market — and by extension to the training, validation, and testing datasets those systems depend on.
The article's requirements span six functional areas:
| Area | Article Reference | Core Obligation |
|---|---|---|
| Data governance practices | Art.10(1) | Appropriate management practices for all training, validation, and testing data |
| Documentation requirements | Art.10(2)(a)–(g) | Seven specific documentation categories covering design choices through gap analysis |
| Dataset quality standards | Art.10(3) | Relevant, representative, complete, error-free datasets to the extent possible |
| Bias monitoring with sensitive data | Art.10(4) | Special category data processing permitted solely for bias detection and correction |
| Regulatory sandbox access | Art.10(5) | Competent authority access to datasets in sandbox contexts |
| General purpose AI applicability | Art.10(6) | Art.10 requirements apply where GPAI systems are high-risk |
The August 2, 2026 deadline applies to all high-risk AI systems under Annex III of the EU AI Act — including systems in biometric identification, critical infrastructure, education, employment, access to services, law enforcement, migration management, and administration of justice.
The Complete Compliance Checklist
Section 1: Data Governance Framework (Art.10(1))
Governance Structure
- Designated data governance owner (role, not just a name) responsible for Art.10 compliance
- Written data governance policy covering all training, validation, and testing datasets
- Policy version-controlled and reviewed at least annually
- Policy reviewed after any significant dataset update or model version change
- Governance policy linked to your Annex IV technical documentation package
Dataset Registry
- Central registry of all datasets used for training, validation, and testing
- Each dataset entry includes: name, version, format, size, acquisition date, current location
- Registry updated as part of your model release process (not retrospectively)
- Registry accessible to internal auditors and, on request, to competent authorities
Governance Process Integration
- Data governance sign-off required before any model reaches staging environment
- Data governance review is a blocking step in your CI/CD pipeline (see Post 4)
- Incident response process includes data governance triggers (e.g., discovered bias, corrupted dataset)
Section 2: Documentation Requirements (Art.10(2)(a)–(g))
Art.10(2)(a) — Design Choices
- Document why each dataset was selected over alternatives considered
- Record the design rationale for your train/validation/test split proportions
- Document feature engineering decisions and their data implications
- Record decisions to exclude or downsample any population subgroups with justification
Art.10(2)(b) — Data Collection Processes and Origin
- Full provenance chain for each dataset: origin → collection method → acquisition path → current location
- For personal data: original collection purpose documented and verified as compatible with AI training use
- Third-party datasets: license reviewed, data processing agreement in place if personal data involved
- Web-scraped or crawled data: legal basis documented per GDPR Art.6 or applicable exemption
- Synthetic data: generation methodology documented, validation against real-world distributions recorded
Art.10(2)(c) — Data Preparation Operations
- All annotation, labelling, and tagging procedures documented with version history
- Annotator qualification criteria and inter-annotator agreement scores recorded
- Cleaning operations documented: what was removed, why, and how much of the dataset was affected
- Enrichment procedures: what data was added, from what source, under what legal basis
- Aggregation logic documented including join keys, deduplication rules, and merge methodology
- Transformation pipelines stored as version-controlled code (not undocumented scripts)
Art.10(2)(d) — Assumptions
- Written documentation of all significant assumptions made about the dataset and its representativeness
- Assumptions about real-world distribution vs. dataset distribution explicitly stated
- Temporal assumptions documented (e.g., "dataset reflects conditions as of Q3 2024")
- Geographic scope assumptions recorded (e.g., which EU member states are represented)
- Assumptions flagged when they materially affect the system's expected performance on subgroups
Art.10(2)(e) — Availability, Quantity, and Suitability Assessment
- Formal suitability assessment completed for each dataset prior to model training
- Minimum dataset size thresholds defined and met for each use case
- Assessment covers whether dataset volume is sufficient for the system's intended deployment context
- Reassessment triggered when deployment context changes or dataset ages significantly
- Assessment document signed off by someone with authority over the training process
Art.10(2)(f) — Bias Examination
- Bias assessment conducted across all protected characteristics relevant to your use case
- Protected characteristics examined: age, disability, gender, race/ethnicity, religion, sexual orientation (minimum)
- Bias metrics recorded: disparate impact ratios, false positive/negative rate differentials by group
- Bias testing conducted on validation set AND on held-out test set separately
- Bias remediation actions documented: what was changed, what effect was measured
- Residual bias acknowledged where not fully mitigated, with documented risk acceptance decision
Art.10(2)(g) — Data Gaps and Shortcomings
- Gap analysis completed: which populations, scenarios, or conditions are underrepresented?
- Each identified gap assigned a severity classification (critical / significant / acceptable)
- Mitigation strategy documented for critical and significant gaps
- Known shortcomings included in your Art.13 transparency documentation to deployers
- Gap register reviewed and updated before each major model version release
Section 3: Dataset Quality Standards (Art.10(3))
Relevance
- Documented rationale for why each dataset is relevant to your specific AI system task
- Relevance confirmed against the system's defined purpose in your Art.11 technical documentation
- Relevance re-verified when the system's intended use expands or changes
Representativeness
- Demographic distribution analysis for all identity-relevant datasets
- Geographic coverage documented relative to intended deployment geography
- Temporal coverage documented — no dataset used that predates the deployment context by more than your defined threshold
- Representation gaps documented as Art.10(2)(g) shortcomings (see above)
Freedom from Errors
- Error detection process applied to all datasets prior to training
- Error types documented: label errors, duplicate records, outliers, corrupted entries
- Error rates recorded: percentage of dataset affected
- Error rate thresholds defined — training proceeds only if error rate is below threshold
- Systematic errors (not random) trigger investigation before training proceeds
Completeness
- Missing value analysis completed for all features used in training
- Missing value handling strategy documented: imputation method, exclusion, or flagging
- High-missingness features (>5% missing) flagged for explicit review before training
Section 4: Sensitive Data for Bias Monitoring (Art.10(4))
Art.10(4) creates a narrow exception: providers may process special category personal data under GDPR Art.9 solely for bias detection and correction in their high-risk AI system.
- If using Art.10(4): legal basis documented as this specific exception
- Sensitive data processed for bias purposes is stored separately from training data
- Access controls restrict sensitive bias-testing data to authorised personnel only
- Data minimisation principle applied: only the minimum necessary sensitive data processed
- Retention period defined and data deleted or anonymised after bias testing is complete
- Art.10(4) processing documented in your GDPR Record of Processing Activities (RoPA)
- DPIA completed if the bias testing involves large-scale processing of special category data
Section 5: CI/CD Integration Verification (Cross-Reference Post 4)
These items verify that your Art.10 compliance is automated and enforced at the pipeline level:
- Gate 1 (Provenance): Pipeline blocks if any dataset lacks a complete provenance record
- Gate 2 (Documentation): Pipeline blocks if Art.10(2)(a)–(g) documentation is incomplete
- Gate 3 (Bias Testing): Pipeline blocks if bias assessment has not been completed for this dataset version
- Gate 4 (Gap Registry): Pipeline blocks if new gaps have been identified but not classified
- Gate 5 (Statistical Properties): Pipeline blocks if dataset fails defined quality thresholds
- Compliance certificate auto-generated and committed to your audit log on each successful gate pass
- Failed gate outputs are retained as evidence (not silently discarded)
Section 6: Audit Readiness
These items ensure you can respond to a competent authority inquiry within a reasonable timeframe:
- All Art.10 documentation consolidated in a single retrievable location (not spread across team wikis)
- Documentation timestamped and version-controlled — auditors will ask for the state at a specific model version
- Named individual who can present Art.10 documentation to auditors on short notice
- Evidence of Art.10 compliance included in or linked from your Annex IV technical documentation package
- Internal audit of Art.10 compliance completed at least once before August 2, 2026
- Remediation log maintained — if you found and fixed gaps, the audit trail shows you found them proactively
Scoring Your Current Posture
Score each section against your current state:
Section 1: Data Governance Framework ___/10 items
Section 2: Documentation Requirements ___/35 items
Section 3: Dataset Quality Standards ___/14 items
Section 4: Sensitive Data (if applicable) ___/7 items
Section 5: CI/CD Integration ___/7 items
Section 6: Audit Readiness ___/6 items
___/79 total
Interpretation:
- 70–79 (89%+): Audit-ready. Minor gaps only — assign owners and close before July 15.
- 55–69 (70–88%): On track but needs acceleration. Identify your largest incomplete section and sprint it in the next 4 weeks.
- 40–54 (50–69%): Significant gaps. Prioritise Section 2 documentation and Section 5 CI/CD gates — these have the longest implementation lead times.
- Below 40 (<50%): Critical state. Engage a compliance consultant alongside your engineering effort. You cannot close this gap by August 2 without a structured programme.
The 10 Highest-Risk Gaps (What Auditors Look for First)
Based on the audit patterns that have emerged from early AI Act enforcement guidance and analogous GDPR enforcement decisions:
- No provenance chain for third-party datasets — "We licensed it" is not sufficient without the collection method and original purpose documented
- Bias testing only on training set, not test set — Art.10(2)(f) requires examination before deployment; validation set bias alone does not cover this
- Undocumented cleaning operations — If you dropped 15% of your dataset to remove outliers and didn't document why, that is a gap
- Assumptions documented nowhere — Temporal and geographic assumptions are the most commonly missing
- No gap register — Art.10(2)(g) requires that you identified gaps, not just that none exist
- Art.10(4) data mixed with training data — Processing sensitive data for bias detection must be strictly separated
- Design choices not version-controlled — If you can't show what choices were made for a specific model version, you cannot demonstrate compliance at audit time
- No designated governance owner — "The team is responsible" does not satisfy the governance framework obligation
- Error rates not measured — "We cleaned the data" without recorded error rates before and after is not sufficient
- Art.10 documentation not linked to Annex IV — Competent authorities review technical documentation first; if Art.10 evidence is in a separate system with no pointer, it may as well not exist
Implementation Timeline: June 3 to August 2, 2026
June 3–15 (12 days): Audit current state using this checklist. Score each section.
Assign owners to each gap. Triage: Critical / Important / Nice-to-have.
June 16–30 (15 days): Close Critical gaps. Priority order:
1. Governance owner designation
2. Dataset registry creation
3. Provenance documentation for all training datasets
4. Bias testing with recorded metrics
July 1–15 (15 days): Close Important gaps:
1. Art.10(2)(a)–(g) documentation for all datasets
2. CI/CD gate implementation
3. Error rate measurement and recording
July 16–25 (10 days): Internal compliance audit.
Simulate a competent authority documentation request.
Fix remaining gaps identified.
July 26–31 (6 days): Final review. Lock documentation versions.
Confirm technical documentation package is complete.
August 2, 2026: Deadline. System must be compliant from this date.
Quick-Reference: Key Article Numbers for Art.10
For your documentation and during auditor discussions:
| Reference | What It Covers |
|---|---|
| Art.10(1) | General data governance obligations |
| Art.10(2)(a) | Design choice documentation |
| Art.10(2)(b) | Data collection processes and origin |
| Art.10(2)(c) | Data preparation operations |
| Art.10(2)(d) | Relevant assumptions |
| Art.10(2)(e) | Dataset availability, quantity, suitability |
| Art.10(2)(f) | Bias examination |
| Art.10(2)(g) | Data gaps and shortcomings |
| Art.10(3) | Dataset quality: relevance, representativeness, error-free |
| Art.10(4) | Sensitive data exception for bias detection |
| Art.10(5) | Competent authority access in sandboxes |
| Art.10(6) | Application to general purpose AI in high-risk context |
| Art.11 | Technical documentation (links to Art.10 evidence) |
| Art.17 | Quality management system (governance framework home) |
| Art.99(4) | Penalties: €15M or 3% global annual turnover |
Closing the Sprint
This five-post sprint covered the complete Art.10 compliance lifecycle — from understanding what the article requires, through bias testing methodology, provenance logging architecture, CI/CD gate implementation, and now this consolidated checklist.
The August 2 deadline is fixed. The obligations are specific. The checklist above gives you a concrete audit surface: 79 items that, when checked, represent a defensible Art.10 compliance posture.
The organisations that will face enforcement action in the first wave after August 2 are not primarily the ones that tried and fell slightly short — they are the ones that have no documentation at all, no governance owner, no record of ever having considered Art.10. If you have completed this sprint, you are already in a significantly better position than that.
This post is part of the sota.io EU AI Act Compliance Series. Related: Art.10 Data Governance Foundations — Dataset Diversity and Bias Testing — Provenance Logging — CI/CD Data Governance Gates
sota.io is EU-native managed PaaS — deploy compliant AI infrastructure on Hetzner Germany, no CLOUD Act exposure. Get started →
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.