2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Governance Finale: Complete Training Data Compliance Checklist Before August 2026

Post #5 (Finale) in the sota.io EU AI Act Data Governance Sprint — August 2026 Deadline

EU AI Act Art.10 Data Governance compliance checklist for August 2026 deadline

Sixty days from now, high-risk AI systems operating in the EU must be fully compliant with Art.10 of the EU AI Act. The deadline — August 2, 2026 — is not a filing date. It is the date from which non-compliant systems face market surveillance scrutiny, notified body audits, and penalties reaching €15 million or 3% of global annual turnover under Art.99.

This finale post consolidates everything from the four-part sprint into a single, actionable compliance checklist. Use it to audit your current data governance posture, identify gaps, and assign remediation tasks before the deadline arrives.

The Sprint So Far:

Post 1: Art.10 foundations — what training data documentation the EU AI Act actually requires
Post 2: Dataset diversity and bias testing — how to audit training data for compliance
Post 3: Data provenance logging — tracking training data origin and governance records
Post 4: CI/CD data governance gates — automating compliance checks in your pipeline

What Art.10 Covers: The Full Scope

Art.10 of the EU AI Act governs data and data governance for high-risk AI systems. It applies to providers — organisations that develop or place high-risk AI systems on the EU market — and by extension to the training, validation, and testing datasets those systems depend on.

The article's requirements span six functional areas:

Area	Article Reference	Core Obligation
Data governance practices	Art.10(1)	Appropriate management practices for all training, validation, and testing data
Documentation requirements	Art.10(2)(a)–(g)	Seven specific documentation categories covering design choices through gap analysis
Dataset quality standards	Art.10(3)	Relevant, representative, complete, error-free datasets to the extent possible
Bias monitoring with sensitive data	Art.10(4)	Special category data processing permitted solely for bias detection and correction
Regulatory sandbox access	Art.10(5)	Competent authority access to datasets in sandbox contexts
General purpose AI applicability	Art.10(6)	Art.10 requirements apply where GPAI systems are high-risk

The August 2, 2026 deadline applies to all high-risk AI systems under Annex III of the EU AI Act — including systems in biometric identification, critical infrastructure, education, employment, access to services, law enforcement, migration management, and administration of justice.

The Complete Compliance Checklist

Section 1: Data Governance Framework (Art.10(1))

Governance Structure

Designated data governance owner (role, not just a name) responsible for Art.10 compliance
Written data governance policy covering all training, validation, and testing datasets
Policy version-controlled and reviewed at least annually
Policy reviewed after any significant dataset update or model version change
Governance policy linked to your Annex IV technical documentation package

Dataset Registry

Central registry of all datasets used for training, validation, and testing
Each dataset entry includes: name, version, format, size, acquisition date, current location
Registry updated as part of your model release process (not retrospectively)
Registry accessible to internal auditors and, on request, to competent authorities

Governance Process Integration

Data governance sign-off required before any model reaches staging environment
Data governance review is a blocking step in your CI/CD pipeline (see Post 4)
Incident response process includes data governance triggers (e.g., discovered bias, corrupted dataset)

Section 2: Documentation Requirements (Art.10(2)(a)–(g))

Art.10(2)(a) — Design Choices

Document why each dataset was selected over alternatives considered
Record the design rationale for your train/validation/test split proportions
Document feature engineering decisions and their data implications
Record decisions to exclude or downsample any population subgroups with justification

Art.10(2)(b) — Data Collection Processes and Origin

Full provenance chain for each dataset: origin → collection method → acquisition path → current location
For personal data: original collection purpose documented and verified as compatible with AI training use
Third-party datasets: license reviewed, data processing agreement in place if personal data involved
Web-scraped or crawled data: legal basis documented per GDPR Art.6 or applicable exemption
Synthetic data: generation methodology documented, validation against real-world distributions recorded

Art.10(2)(c) — Data Preparation Operations

All annotation, labelling, and tagging procedures documented with version history
Annotator qualification criteria and inter-annotator agreement scores recorded
Cleaning operations documented: what was removed, why, and how much of the dataset was affected
Enrichment procedures: what data was added, from what source, under what legal basis
Aggregation logic documented including join keys, deduplication rules, and merge methodology
Transformation pipelines stored as version-controlled code (not undocumented scripts)

Art.10(2)(d) — Assumptions

Written documentation of all significant assumptions made about the dataset and its representativeness
Assumptions about real-world distribution vs. dataset distribution explicitly stated
Temporal assumptions documented (e.g., "dataset reflects conditions as of Q3 2024")
Geographic scope assumptions recorded (e.g., which EU member states are represented)
Assumptions flagged when they materially affect the system's expected performance on subgroups

Art.10(2)(e) — Availability, Quantity, and Suitability Assessment

Formal suitability assessment completed for each dataset prior to model training
Minimum dataset size thresholds defined and met for each use case
Assessment covers whether dataset volume is sufficient for the system's intended deployment context
Reassessment triggered when deployment context changes or dataset ages significantly
Assessment document signed off by someone with authority over the training process

Art.10(2)(f) — Bias Examination

Bias assessment conducted across all protected characteristics relevant to your use case
Protected characteristics examined: age, disability, gender, race/ethnicity, religion, sexual orientation (minimum)
Bias metrics recorded: disparate impact ratios, false positive/negative rate differentials by group
Bias testing conducted on validation set AND on held-out test set separately
Bias remediation actions documented: what was changed, what effect was measured
Residual bias acknowledged where not fully mitigated, with documented risk acceptance decision

Art.10(2)(g) — Data Gaps and Shortcomings

Gap analysis completed: which populations, scenarios, or conditions are underrepresented?
Each identified gap assigned a severity classification (critical / significant / acceptable)
Mitigation strategy documented for critical and significant gaps
Known shortcomings included in your Art.13 transparency documentation to deployers
Gap register reviewed and updated before each major model version release

Section 3: Dataset Quality Standards (Art.10(3))

Relevance

Documented rationale for why each dataset is relevant to your specific AI system task
Relevance confirmed against the system's defined purpose in your Art.11 technical documentation
Relevance re-verified when the system's intended use expands or changes

Representativeness

Demographic distribution analysis for all identity-relevant datasets
Geographic coverage documented relative to intended deployment geography
Temporal coverage documented — no dataset used that predates the deployment context by more than your defined threshold
Representation gaps documented as Art.10(2)(g) shortcomings (see above)

Freedom from Errors

Error detection process applied to all datasets prior to training
Error types documented: label errors, duplicate records, outliers, corrupted entries
Error rates recorded: percentage of dataset affected
Error rate thresholds defined — training proceeds only if error rate is below threshold
Systematic errors (not random) trigger investigation before training proceeds

Completeness

Missing value analysis completed for all features used in training
Missing value handling strategy documented: imputation method, exclusion, or flagging
High-missingness features (>5% missing) flagged for explicit review before training

Section 4: Sensitive Data for Bias Monitoring (Art.10(4))

Art.10(4) creates a narrow exception: providers may process special category personal data under GDPR Art.9 solely for bias detection and correction in their high-risk AI system.

If using Art.10(4): legal basis documented as this specific exception
Sensitive data processed for bias purposes is stored separately from training data
Access controls restrict sensitive bias-testing data to authorised personnel only
Data minimisation principle applied: only the minimum necessary sensitive data processed
Retention period defined and data deleted or anonymised after bias testing is complete
Art.10(4) processing documented in your GDPR Record of Processing Activities (RoPA)
DPIA completed if the bias testing involves large-scale processing of special category data

Section 5: CI/CD Integration Verification (Cross-Reference Post 4)

These items verify that your Art.10 compliance is automated and enforced at the pipeline level:

Gate 1 (Provenance): Pipeline blocks if any dataset lacks a complete provenance record
Gate 2 (Documentation): Pipeline blocks if Art.10(2)(a)–(g) documentation is incomplete
Gate 3 (Bias Testing): Pipeline blocks if bias assessment has not been completed for this dataset version
Gate 4 (Gap Registry): Pipeline blocks if new gaps have been identified but not classified
Gate 5 (Statistical Properties): Pipeline blocks if dataset fails defined quality thresholds
Compliance certificate auto-generated and committed to your audit log on each successful gate pass
Failed gate outputs are retained as evidence (not silently discarded)

Section 6: Audit Readiness

These items ensure you can respond to a competent authority inquiry within a reasonable timeframe:

All Art.10 documentation consolidated in a single retrievable location (not spread across team wikis)
Documentation timestamped and version-controlled — auditors will ask for the state at a specific model version
Named individual who can present Art.10 documentation to auditors on short notice
Evidence of Art.10 compliance included in or linked from your Annex IV technical documentation package
Internal audit of Art.10 compliance completed at least once before August 2, 2026
Remediation log maintained — if you found and fixed gaps, the audit trail shows you found them proactively

Scoring Your Current Posture

Score each section against your current state:

Section 1: Data Governance Framework     ___/10 items
Section 2: Documentation Requirements   ___/35 items  
Section 3: Dataset Quality Standards    ___/14 items
Section 4: Sensitive Data (if applicable) ___/7 items
Section 5: CI/CD Integration            ___/7 items
Section 6: Audit Readiness              ___/6 items
                                        ___/79 total

Interpretation:

70–79 (89%+): Audit-ready. Minor gaps only — assign owners and close before July 15.
55–69 (70–88%): On track but needs acceleration. Identify your largest incomplete section and sprint it in the next 4 weeks.
40–54 (50–69%): Significant gaps. Prioritise Section 2 documentation and Section 5 CI/CD gates — these have the longest implementation lead times.
Below 40 (<50%): Critical state. Engage a compliance consultant alongside your engineering effort. You cannot close this gap by August 2 without a structured programme.

The 10 Highest-Risk Gaps (What Auditors Look for First)

Based on the audit patterns that have emerged from early AI Act enforcement guidance and analogous GDPR enforcement decisions:

No provenance chain for third-party datasets — "We licensed it" is not sufficient without the collection method and original purpose documented
Bias testing only on training set, not test set — Art.10(2)(f) requires examination before deployment; validation set bias alone does not cover this
Undocumented cleaning operations — If you dropped 15% of your dataset to remove outliers and didn't document why, that is a gap
Assumptions documented nowhere — Temporal and geographic assumptions are the most commonly missing
No gap register — Art.10(2)(g) requires that you identified gaps, not just that none exist
Art.10(4) data mixed with training data — Processing sensitive data for bias detection must be strictly separated
Design choices not version-controlled — If you can't show what choices were made for a specific model version, you cannot demonstrate compliance at audit time
No designated governance owner — "The team is responsible" does not satisfy the governance framework obligation
Error rates not measured — "We cleaned the data" without recorded error rates before and after is not sufficient
Art.10 documentation not linked to Annex IV — Competent authorities review technical documentation first; if Art.10 evidence is in a separate system with no pointer, it may as well not exist

Implementation Timeline: June 3 to August 2, 2026

June 3–15  (12 days): Audit current state using this checklist. Score each section.
                      Assign owners to each gap. Triage: Critical / Important / Nice-to-have.

June 16–30 (15 days): Close Critical gaps. Priority order:
                       1. Governance owner designation
                       2. Dataset registry creation
                       3. Provenance documentation for all training datasets
                       4. Bias testing with recorded metrics

July 1–15  (15 days): Close Important gaps:
                       1. Art.10(2)(a)–(g) documentation for all datasets
                       2. CI/CD gate implementation
                       3. Error rate measurement and recording

July 16–25  (10 days): Internal compliance audit.
                        Simulate a competent authority documentation request.
                        Fix remaining gaps identified.

July 26–31   (6 days): Final review. Lock documentation versions.
                        Confirm technical documentation package is complete.

August 2, 2026: Deadline. System must be compliant from this date.

Quick-Reference: Key Article Numbers for Art.10

For your documentation and during auditor discussions:

Reference	What It Covers
Art.10(1)	General data governance obligations
Art.10(2)(a)	Design choice documentation
Art.10(2)(b)	Data collection processes and origin
Art.10(2)(c)	Data preparation operations
Art.10(2)(d)	Relevant assumptions
Art.10(2)(e)	Dataset availability, quantity, suitability
Art.10(2)(f)	Bias examination
Art.10(2)(g)	Data gaps and shortcomings
Art.10(3)	Dataset quality: relevance, representativeness, error-free
Art.10(4)	Sensitive data exception for bias detection
Art.10(5)	Competent authority access in sandboxes
Art.10(6)	Application to general purpose AI in high-risk context
Art.11	Technical documentation (links to Art.10 evidence)
Art.17	Quality management system (governance framework home)
Art.99(4)	Penalties: €15M or 3% global annual turnover

Closing the Sprint

This five-post sprint covered the complete Art.10 compliance lifecycle — from understanding what the article requires, through bias testing methodology, provenance logging architecture, CI/CD gate implementation, and now this consolidated checklist.

The August 2 deadline is fixed. The obligations are specific. The checklist above gives you a concrete audit surface: 79 items that, when checked, represent a defensible Art.10 compliance posture.

The organisations that will face enforcement action in the first wave after August 2 are not primarily the ones that tried and fell slightly short — they are the ones that have no documentation at all, no governance owner, no record of ever having considered Art.10. If you have completed this sprint, you are already in a significantly better position than that.

This post is part of the sota.io EU AI Act Compliance Series. Related: Art.10 Data Governance Foundations — Dataset Diversity and Bias Testing — Provenance Logging — CI/CD Data Governance Gates

sota.io is EU-native managed PaaS — deploy compliant AI infrastructure on Hetzner Germany, no CLOUD Act exposure. Get started →

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing