2026-04-10·15 min read·sota.io team

EU AI Act Art.10 Training Data Governance: Developer Guide (Bias Examination, Data Gaps, Special Category Data)

Article 10 of the EU AI Act (Regulation 2024/1689) is the provision that most directly regulates what data you can use to train high-risk AI systems, how you must document it, and what you must do when your datasets are incomplete or potentially biased. It is one of the least-understood technical obligations in the Act — yet it will determine whether your conformity assessment succeeds or fails.

High-risk AI systems under Annex III (biometrics, critical infrastructure, education, employment, essential services, law enforcement, migration, justice) must comply with Art.10 by 2 August 2026. Systems embedded in regulated products (Annex I) face the 2027 deadline. The Art.10(5) special category data exception is already operational for providers that can justify its use.

This guide covers every sub-paragraph of Art.10, the critical intersection with GDPR's storage limitation principle, the EU Data Act's impact on IoT training data provenance, and how training pipeline architecture determines whether you operate in one legal regime or two.

Article 10 in the AI Act Framework

Art.10 sits within Chapter 2 of the AI Act (Obligations for Providers of High-Risk AI Systems), alongside:

Article	Obligation
Art.9	Risk management system (lifecycle-spanning)
Art.10	Data governance and management practices
Art.11	Technical documentation (Annex IV)
Art.12	Record-keeping and logging
Art.13	Transparency and provision of information
Art.14	Human oversight
Art.15	Accuracy, robustness, cybersecurity

Art.10 applies to training, validation, and testing datasets. The obligations are identical for all three dataset types — not just training data. A system that passes training data governance but uses a biased validation set has a structural Art.10 failure.

Art.10(1) — Scope: Training, Validation, and Testing Datasets

Art.10(1) establishes that providers must implement data governance and management practices covering training, validation, and testing datasets used for high-risk AI systems.

What counts as a "dataset" for Art.10:

Datasets used to train model weights (training datasets)
Datasets used to tune hyperparameters or select model architecture (validation datasets)
Datasets used to evaluate final model performance (test datasets)
Fine-tuning datasets applied to foundation models for specific high-risk applications
Retrieval-Augmented Generation (RAG) corpora where the retrieval component is part of the high-risk AI output

What is excluded:

GPAI model training datasets (governed separately by Art.53, not Art.10 — though downstream high-risk fine-tuning datasets do fall under Art.10)
Inference-time user data that does not modify model weights

Art.10(2) — Data Governance Requirements: The Six Obligations

Art.10(2)(a)–(f) imposes six data governance requirements. All must be documented in your Art.11 technical documentation (Annex IV).

Art.10(2)(a) — Relevant Design Choices

Data governance must cover relevant design choices concerning the data, including:

Data sources (origin, collection method, collection date)
Data collection and labeling procedures
Assumptions made about data collection context
Data exclusions (what was excluded and why)
Preprocessing steps (normalization, cleaning, transformation)

Practical implication: Every design decision that shaped what data entered your training pipeline must be documented. This is not a post-hoc exercise — the AI Act expects these decisions to be made consciously during data collection, not reconstructed afterward.

import json
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Optional, List

@dataclass
class DataDesignDecision:
    """
    Art.10(2)(a) — Document relevant design choices for training data.
    Must be included in Annex IV Technical Documentation.
    """
    decision_id: str
    decision_type: str  # "inclusion_criterion", "exclusion_criterion", "preprocessing", "labeling"
    description: str
    rationale: str
    date_decided: str
    decided_by: str  # Role, not personal name (GDPR compliance)
    impact_on_representativeness: Optional[str] = None
    
@dataclass
class DataSourceRecord:
    """Art.10(2)(a) — Document each data source."""
    source_id: str
    source_type: str  # "proprietary", "public", "licensed", "synthetic", "iot_device"
    collection_method: str  # "scraped", "surveyed", "sensor", "api", "purchased"
    collection_date_start: str
    collection_date_end: str
    geographic_scope: str  # "EU", "DE", "global", etc.
    demographic_scope: Optional[str] = None
    eu_data_act_applicable: bool = False  # True if IoT-generated (EU Data Act Art.4-5)
    gdpr_legal_basis: Optional[str] = None  # Required if personal data included

class Art10DataGovernanceLog:
    """
    Tracks Art.10(2)(a)-(f) compliance for a high-risk AI training pipeline.
    Output feeds into Art.11 Annex IV Technical Documentation.
    """
    
    def __init__(self, system_name: str, annex_iii_category: str):
        self.system_name = system_name
        self.annex_iii_category = annex_iii_category
        self.data_sources: List[DataSourceRecord] = []
        self.design_decisions: List[DataDesignDecision] = []
        self.created_at = datetime.utcnow().isoformat()
    
    def add_data_source(self, source: DataSourceRecord):
        self.data_sources.append(source)
    
    def add_design_decision(self, decision: DataDesignDecision):
        self.design_decisions.append(decision)
    
    def export_annex_iv_section(self) -> dict:
        """Export Art.10(2)(a) compliance data for Annex IV documentation."""
        return {
            "article": "10(2)(a)",
            "system": self.system_name,
            "annex_iii_category": self.annex_iii_category,
            "data_sources": [asdict(s) for s in self.data_sources],
            "design_decisions": [asdict(d) for d in self.design_decisions],
            "generated_at": datetime.utcnow().isoformat(),
        }

Art.10(2)(b) — Relevance and Representativeness

Training, validation, and testing datasets must be:

Relevant to the intended purpose of the AI system
Representative of the population or context in which the system will operate
Free of errors to the extent possible
Complete with respect to the characteristics needed for the intended purpose

The representativeness requirement is the most technically demanding. For an Annex III Category 4 employment AI system that screens CVs in Germany, your training dataset must be representative of:

The demographic distribution of workers eligible for the role in Germany
The distribution of qualifications, languages, and work history formats in the German labor market
The temporal distribution (hiring patterns from periods relevant to current conditions)

A training dataset composed exclusively of historical successful hires is structurally unrepresentative: it excludes the "counterfactual negative" population (qualified candidates who were not hired for reasons unrelated to merit). This is both an Art.10(2)(b) violation and a direct path to Art.10(3) discriminatory bias.

from scipy import stats
import numpy as np
from typing import Dict, Any

class RepresentativenessAnalyzer:
    """
    Art.10(2)(b) — Assess whether training datasets are sufficiently representative.
    Statistical testing approach for EU AI Act compliance.
    """
    
    def __init__(self, reference_population: Dict[str, float], alpha: float = 0.05):
        """
        reference_population: Expected distribution in target deployment context.
                              E.g., {"age_18_30": 0.35, "age_31_50": 0.45, "age_51_plus": 0.20}
        alpha: Significance level for chi-square test (default 0.05 = 95% confidence).
        """
        self.reference = reference_population
        self.alpha = alpha
    
    def check_demographic_representativeness(
        self, 
        dataset_distribution: Dict[str, float],
        sample_size: int
    ) -> dict:
        """
        Chi-square goodness-of-fit test: Is dataset distribution consistent with reference?
        Returns PASS/FAIL with p-value for Annex IV documentation.
        """
        categories = list(self.reference.keys())
        expected_counts = [self.reference[c] * sample_size for c in categories]
        observed_counts = [dataset_distribution.get(c, 0) * sample_size for c in categories]
        
        chi2_stat, p_value = stats.chisquare(observed_counts, f_exp=expected_counts)
        
        return {
            "test": "chi-square goodness-of-fit",
            "article": "10(2)(b) representativeness",
            "chi2_statistic": round(chi2_stat, 4),
            "p_value": round(p_value, 4),
            "alpha": self.alpha,
            "result": "PASS" if p_value > self.alpha else "FAIL",
            "interpretation": (
                "Dataset distribution is not significantly different from reference population"
                if p_value > self.alpha
                else f"Dataset distribution significantly deviates from reference (p={p_value:.4f} < {self.alpha}) — Art.10(2)(b) violation risk"
            ),
            "underrepresented_groups": [
                c for c in categories 
                if dataset_distribution.get(c, 0) < self.reference[c] * 0.7  # <70% of expected
            ]
        }
    
    def check_temporal_coverage(
        self, 
        dataset_date_range: tuple,  # (start_year, end_year) 
        deployment_context_year: int,
        max_staleness_years: int = 5
    ) -> dict:
        """Check whether training data is temporally relevant to deployment context."""
        _, end_year = dataset_date_range
        staleness = deployment_context_year - end_year
        
        return {
            "article": "10(2)(b) relevance",
            "training_data_end_year": end_year,
            "deployment_year": deployment_context_year,
            "staleness_years": staleness,
            "result": "PASS" if staleness <= max_staleness_years else "FAIL",
            "risk": (
                "Temporal distribution gap may cause performance degradation in current context"
                if staleness > max_staleness_years else None
            )
        }

Art.10(2)(c) — Appropriate Statistical Properties

Datasets must have appropriate statistical properties for the system's intended purpose, including the persons or groups of persons on which the high-risk AI system is intended to be used.

This provision operationalizes the concept that statistical sufficiency is purpose-specific. An employment AI system used across 27 EU member states requires statistical properties appropriate for each national labor market — a single aggregate distribution is not sufficient.

Minimum sample size guidance (from NIST AI 100-1 and EU AI Act technical standards bodies):

For binary classification in high-risk contexts: minimum 1,000 samples per demographic subgroup evaluated
For minority group performance assessment: minimum 100 samples per group for statistical significance at α=0.05
For intersectional analysis (e.g., gender × age × nationality): minimum 50 samples per intersection cell

Art.10(2)(d)–(f) — Additional Requirements

Art.10(2)(d) — Suitability to Intended Purpose: Data must be suitable for the AI system's specific intended use. A biometric system trained on adult faces cannot be deployed for child identification without separate compliance assessment.

Art.10(2)(e) — Examination for Biases: Datasets must be examined for possible biases that are likely to affect the health and safety of persons or to lead to discrimination prohibited under Union law (Art.21 Charter, GDPR Art.9, anti-discrimination directives). This obligation is developed further in Art.10(3) — the most technically complex provision.

Art.10(2)(f) — Proportionate Data Collection Measures: Where personal data is included, appropriate data management practices consistent with GDPR must be applied. This creates the Art.10(2)(f) × GDPR Art.5(1)(e) retention conflict discussed below.

Art.10(3) — Bias Examination: The Technical Core

Art.10(3) requires providers to examine training, validation, and testing datasets for possible biases that could lead to:

Discrimination against persons based on characteristics protected under Art.21 EU Charter (sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political opinion, disability, age, sexual orientation, nationality)
Discrimination prohibited under sector-specific EU law (employment directives, equal treatment directives)

This is not optional. Art.10(3) is a mandatory obligation for all high-risk AI providers. The absence of bias examination documentation is itself an Art.10 violation — you do not need to find bias to fail compliance, you need to fail to look for it.

from dataclasses import dataclass, field
from typing import List, Tuple, Optional
import numpy as np

@dataclass
class BiasExaminationReport:
    """
    Art.10(3) — Bias examination report for Annex IV Technical Documentation.
    Must be completed for EVERY training, validation, and test dataset.
    """
    dataset_id: str
    examination_date: str
    examiner_role: str  # Not personal name — role (e.g., "Data Governance Lead")
    
    # Protected characteristics examined (Art.21 EU Charter)
    characteristics_examined: List[str] = field(default_factory=list)
    
    # Bias metrics used
    metrics_applied: List[str] = field(default_factory=list)
    
    # Findings per characteristic
    findings: List[dict] = field(default_factory=list)
    
    # Mitigation measures applied
    mitigations_applied: List[str] = field(default_factory=list)
    
    # Residual risk assessment
    residual_bias_risk: str = "NOT_ASSESSED"  # LOW | MEDIUM | HIGH | NOT_ASSESSED

class Art10BiasExamination:
    """
    Systematic bias examination framework for Art.10(3) compliance.
    
    Covers: disparate impact, equalized odds, calibration, representation parity.
    All findings must be documented in the Art.11 Technical Documentation (Annex IV).
    """
    
    EU_CHARTER_ART21_CHARACTERISTICS = [
        "sex", "race", "colour", "ethnic_or_social_origin", "genetic_features",
        "language", "religion_or_belief", "political_opinion", "disability",
        "age", "sexual_orientation", "nationality"
    ]
    
    def compute_disparate_impact(
        self, 
        y_pred: np.ndarray, 
        protected_attribute: np.ndarray,
        favorable_outcome: int = 1
    ) -> Tuple[float, str]:
        """
        Disparate Impact Ratio (DIR): P(Y=1|A=0) / P(Y=1|A=1)
        where A=1 is the privileged group.
        
        EU threshold guidance: DIR < 0.8 indicates potential discrimination
        (aligns with US EEOC 4/5ths rule, referenced in EU employment directives).
        """
        groups = np.unique(protected_attribute)
        if len(groups) != 2:
            return None, "Cannot compute DIR for non-binary protected attribute"
        
        group_0_rate = np.mean(y_pred[protected_attribute == groups[0]] == favorable_outcome)
        group_1_rate = np.mean(y_pred[protected_attribute == groups[1]] == favorable_outcome)
        
        # Use minority group as numerator
        dir_ratio = min(group_0_rate, group_1_rate) / max(group_0_rate, group_1_rate)
        
        if dir_ratio < 0.8:
            status = f"POTENTIAL_DISCRIMINATION (DIR={dir_ratio:.3f} < 0.8 threshold)"
        elif dir_ratio < 0.9:
            status = f"BORDERLINE (DIR={dir_ratio:.3f}, monitor closely)"
        else:
            status = f"PASS (DIR={dir_ratio:.3f})"
        
        return dir_ratio, status
    
    def compute_equalized_odds(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        protected_attribute: np.ndarray
    ) -> dict:
        """
        Equalized Odds: Equal TPR and FPR across protected groups.
        Critical for law enforcement (Art.10(3) — high-stakes decision AI).
        """
        groups = np.unique(protected_attribute)
        results = {}
        
        for group in groups:
            mask = protected_attribute == group
            true_pos = np.sum((y_true[mask] == 1) & (y_pred[mask] == 1))
            false_neg = np.sum((y_true[mask] == 1) & (y_pred[mask] == 0))
            true_neg = np.sum((y_true[mask] == 0) & (y_pred[mask] == 0))
            false_pos = np.sum((y_true[mask] == 0) & (y_pred[mask] == 1))
            
            tpr = true_pos / (true_pos + false_neg) if (true_pos + false_neg) > 0 else 0
            fpr = false_pos / (false_pos + true_neg) if (false_pos + true_neg) > 0 else 0
            
            results[str(group)] = {"tpr": round(tpr, 4), "fpr": round(fpr, 4)}
        
        # Check for equalized odds violation
        tpr_values = [r["tpr"] for r in results.values()]
        fpr_values = [r["fpr"] for r in results.values()]
        
        tpr_gap = max(tpr_values) - min(tpr_values)
        fpr_gap = max(fpr_values) - min(fpr_values)
        
        results["equalized_odds_assessment"] = {
            "tpr_gap": round(tpr_gap, 4),
            "fpr_gap": round(fpr_gap, 4),
            "result": "PASS" if tpr_gap < 0.05 and fpr_gap < 0.05 else "FAIL",
            "article": "10(3) bias examination — equalized odds"
        }
        
        return results
    
    def generate_examination_report(
        self,
        dataset_id: str,
        characteristics_with_data: List[str],
        characteristics_without_data: List[str]
    ) -> BiasExaminationReport:
        """
        Generate Art.10(3) examination report structure.
        Both characteristics WITH data (tested) and WITHOUT data (Art.10(4) gap documentation) 
        must be documented.
        """
        all_eu_characteristics = set(self.EU_CHARTER_ART21_CHARACTERISTICS)
        examined = set(characteristics_with_data)
        gaps = set(characteristics_without_data)
        
        # Any characteristic not examined and not documented as gap = compliance risk
        undocumented = all_eu_characteristics - examined - gaps
        
        return BiasExaminationReport(
            dataset_id=dataset_id,
            examination_date=datetime.utcnow().isoformat(),
            examiner_role="Data Governance Team",
            characteristics_examined=characteristics_with_data,
            metrics_applied=["disparate_impact_ratio", "equalized_odds", "calibration"],
            findings=[
                {
                    "characteristic": c,
                    "data_available": True,
                    "metrics_computed": ["DIR", "equalized_odds"],
                    "result": "pending_computation"
                }
                for c in characteristics_with_data
            ] + [
                {
                    "characteristic": c,
                    "data_available": False,
                    "art10_4_gap_documented": True,
                    "mitigation": "proxy_variable_analysis"
                }
                for c in characteristics_without_data
            ],
            residual_bias_risk="NOT_ASSESSED" if undocumented else "PENDING_REVIEW"
        )

Art.10(4) — Data Gaps: Documenting What You Cannot Measure

Art.10(4) addresses a practical reality: for many protected characteristics (religion, sexual orientation, political opinion), direct data collection is either legally prohibited under GDPR or practically impossible. The provision requires that where such data is not available, providers must:

Document that relevant data is not available in the technical documentation
Identify potential biases that could arise despite the absence of direct data
Implement mitigation measures that are reasonably practicable

Why this matters: The absence of protected characteristic data does not prevent bias. Proxy variables (zip code as a proxy for race, name as a proxy for nationality, job title as a proxy for age) can encode discriminatory correlations even when the protected attribute is never explicitly included. Art.10(4) requires documenting this risk.

@dataclass
class DataGapDocumentation:
    """
    Art.10(4) — Document missing protected characteristic data.
    Required even when data cannot be collected (GDPR Art.9 prohibition).
    """
    characteristic: str  # Protected characteristic with no direct data
    reason_unavailable: str  # "gdpr_art9_prohibition" | "practical_infeasibility" | "not_applicable"
    proxy_variables_identified: List[str]  # Variables that may encode the characteristic
    proxy_bias_risk: str  # Description of how proxies could create discriminatory outcomes
    mitigation_measures: List[str]
    residual_risk: str  # LOW | MEDIUM | HIGH

class ProxyVariableAnalyzer:
    """
    Art.10(4) — Identify proxy variables that may encode protected characteristics.
    
    Common proxy correlations documented in EU anti-discrimination case law:
    - zip_code → ethnicity, social class (spatial segregation)
    - first_name → nationality, religion, ethnicity
    - employment_gap → parenthood/disability
    - educational_institution → social class, religion
    - language_proficiency → national origin
    """
    
    DOCUMENTED_PROXIES = {
        "race": ["zip_code", "postal_code", "neighborhood", "school_attended", "first_name"],
        "religion": ["first_name", "last_name", "dietary_preference", "country_of_origin"],
        "nationality": ["first_name", "last_name", "language_of_cv", "phone_country_code"],
        "age": ["graduation_year", "years_of_experience", "technology_stack_age"],
        "sex": ["first_name", "career_gap", "part_time_employment_history"],
        "disability": ["employment_gap", "special_needs_mentions", "accommodation_requests"],
        "sexual_orientation": ["relationship_status", "same_sex_partner_references"],
        "ethnic_or_social_origin": ["zip_code", "school_attended", "name_etymology"],
    }
    
    def identify_proxies_in_feature_set(self, feature_names: List[str]) -> dict:
        """
        Identify which features in your model may proxy for protected characteristics.
        Returns Art.10(4) gap documentation for all characteristics without direct data.
        """
        proxy_risks = {}
        
        for characteristic, known_proxies in self.DOCUMENTED_PROXIES.items():
            proxies_present = [f for f in feature_names if any(p in f.lower() for p in known_proxies)]
            if proxies_present:
                proxy_risks[characteristic] = {
                    "characteristic": characteristic,
                    "direct_data_available": False,
                    "article_10_4_gap": True,
                    "proxy_features_detected": proxies_present,
                    "bias_risk": "HIGH" if len(proxies_present) > 2 else "MEDIUM",
                    "required_mitigation": [
                        "Proxy correlation analysis",
                        "Counterfactual fairness testing",
                        "Feature importance analysis for proxy exclusion decision"
                    ]
                }
        
        return proxy_risks

Art.10(5) — Special Category Data Exception for Debiasing

Art.10(5) is the most carefully scoped provision in Article 10. It permits processing of special category data (GDPR Art.9(1): racial or ethnic origin, political opinions, religious beliefs, biometric data, health data, sexual orientation) for the purposes of bias detection and correction in high-risk AI systems — but only under strict cumulative conditions:

Strict necessity: Processing must be strictly necessary for bias detection and correction in relation to a specific high-risk AI system
Appropriate safeguards: Technical and organizational measures must be in place for the rights and freedoms of data subjects
Deletion or anonymization: The special category data must be deleted or anonymized once bias correction is complete (or retained only for the duration strictly necessary)
Technical + organizational safeguards: Access controls, purpose limitation, and documentation must be documented

This is not a general exception for training data collection. Art.10(5) does not permit building a training dataset that includes special category data for enrichment. It permits temporary processing of such data specifically to audit and correct bias in an already-developed system.

@dataclass
class Art10_5_ProcessingRecord:
    """
    Art.10(5) — Record of special category data processing for debiasing.
    This record must be retained as evidence of compliance.
    """
    processing_id: str
    high_risk_system_id: str
    annex_iii_category: str
    
    # Special category data processed (GDPR Art.9(1))
    special_categories_used: List[str]  # e.g., ["racial_ethnic_origin", "disability_status"]
    
    # Art.10(5) conditions met
    strict_necessity_justification: str
    bias_type_being_corrected: str
    bias_metric_before: dict
    bias_metric_after: dict
    
    # Safeguards
    access_controls_in_place: List[str]
    purpose_limitation_enforced: bool
    
    # Deletion
    deletion_date: Optional[str] = None  # Required when debiasing is complete
    anonymization_method: Optional[str] = None  # If retained anonymized

class Art10_5_ComplianceChecker:
    """
    Verify that Art.10(5) special category data processing meets all cumulative conditions.
    All four conditions must be met simultaneously — failure of any one = violation.
    """
    
    def verify_conditions(self, processing_record: Art10_5_ProcessingRecord) -> dict:
        results = {}
        
        # Condition 1: Strict necessity
        results["strict_necessity"] = {
            "met": bool(processing_record.strict_necessity_justification),
            "evidence": processing_record.strict_necessity_justification,
            "guidance": "Processing must not be achievable through less intrusive means"
        }
        
        # Condition 2: Appropriate safeguards
        required_safeguards = ["access_control", "audit_log", "purpose_limitation", "need_to_know"]
        safeguards_met = all(
            any(s.lower() in provided.lower() for provided in processing_record.access_controls_in_place)
            for s in required_safeguards
        )
        results["appropriate_safeguards"] = {
            "met": safeguards_met and processing_record.purpose_limitation_enforced,
            "safeguards_documented": processing_record.access_controls_in_place,
            "purpose_limitation": processing_record.purpose_limitation_enforced
        }
        
        # Condition 3: Deletion or anonymization after use
        results["deletion_or_anonymization"] = {
            "met": bool(processing_record.deletion_date or processing_record.anonymization_method),
            "deletion_date": processing_record.deletion_date,
            "anonymization_method": processing_record.anonymization_method,
            "guidance": "Data must be deleted or anonymized once debiasing is complete"
        }
        
        # Condition 4: Demonstrable bias improvement
        has_metrics = bool(processing_record.bias_metric_before and processing_record.bias_metric_after)
        results["demonstrated_improvement"] = {
            "met": has_metrics,
            "metrics_before": processing_record.bias_metric_before,
            "metrics_after": processing_record.bias_metric_after,
            "guidance": "Art.10(5) requires strict necessity — must demonstrate the processing achieved its objective"
        }
        
        all_conditions_met = all(r["met"] for r in results.values())
        results["overall_compliance"] = "PASS" if all_conditions_met else "FAIL"
        results["failing_conditions"] = [k for k, v in results.items() if not v.get("met", True)]
        
        return results

The most practically significant legal tension in Art.10 is its interaction with GDPR Art.5(1)(e) — the storage limitation principle.

GDPR Art.5(1)(e) requires that personal data be kept in a form that permits identification of data subjects for no longer than necessary for the purpose for which it was collected.

AI Act Art.10 + Art.11 requires that technical documentation of training data, including data characteristics and bias examination records, be retained for 10 years after the AI system is placed on the market or put into service (Annex IV, Art.11(3)).

The conflict: If your training dataset includes personal data (names, faces, behavioral traces), GDPR requires deletion when the training purpose is complete. But the AI Act requires retaining documentation of those datasets for 10 years.

The resolution (adopted by the European Data Protection Board — Art.10(6) interpretation):

What must be retained (10 years)	What must be deleted (post-training)
Dataset metadata (size, structure, provenance, statistical properties)	Individual records containing personal data
Bias examination reports (aggregate statistics only)	Raw personal data used in bias examination
Sampling methodology documentation	Special category data after Art.10(5) debiasing
Data source identification (anonymous or pseudonymous)	Personally identifiable training samples
Preprocessing pipeline code and parameters	Personally identifiable validation/test samples

Practical implementation: Maintain a two-tier documentation system — a permanent Art.11 technical documentation package (metadata only, no personal data) and a transient training pipeline that processes personal data under GDPR purposes limitation and deletes after training completion.

class RetentionPolicyManager:
    """
    Implement the GDPR Art.5(1)(e) × AI Act Art.10/11 dual retention policy.
    Separates permanent technical documentation (10yr) from personal data (delete post-training).
    """
    
    PERMANENT_DOCUMENTATION = {
        "dataset_schema": "10 years from market placement",
        "statistical_summary": "10 years from market placement",
        "bias_examination_aggregate_results": "10 years from market placement",
        "data_source_provenance_metadata": "10 years from market placement",
        "preprocessing_pipeline_code": "10 years from market placement",
        "design_decision_log": "10 years from market placement",
        "art10_5_processing_records": "10 years from market placement",
    }
    
    TRANSIENT_DATA = {
        "personal_data_training_records": "Delete after training completion",
        "personal_data_validation_records": "Delete after model selection",
        "personal_data_test_records": "Delete after performance evaluation",
        "special_category_data_debiasing": "Delete after Art.10(5) debiasing process",
        "raw_biometric_training_data": "Delete after feature extraction",
    }
    
    def classify_data_asset(self, asset_name: str, contains_personal_data: bool) -> dict:
        """Classify a data asset for retention policy under dual GDPR/AI Act regime."""
        
        # Check if asset is permanent technical documentation
        for doc_type, retention in self.PERMANENT_DOCUMENTATION.items():
            if doc_type in asset_name.lower():
                return {
                    "asset": asset_name,
                    "retention_regime": "AI Act Art.11 — Technical Documentation",
                    "retention_period": retention,
                    "personal_data_permitted": False,  # Must contain no personal data
                    "action": "Anonymize if currently contains personal data, then retain 10yr"
                }
        
        # Personal data assets
        if contains_personal_data:
            return {
                "asset": asset_name,
                "retention_regime": "GDPR Art.5(1)(e) — Storage Limitation",
                "retention_period": "Delete after training/validation/testing purpose fulfilled",
                "action": "Schedule deletion after training pipeline completion",
                "gdpr_article": "5(1)(e)"
            }
        
        return {
            "asset": asset_name,
            "retention_regime": "No personal data — standard business retention",
            "action": "Apply organization's standard data lifecycle policy"
        }

EU Data Act × Art.10: IoT Training Data Provenance

The EU Data Act (Regulation 2023/2854, applicable since September 2025) creates a new compliance layer for AI systems trained on IoT-generated data.

EU Data Act Art.4-5 grants users of IoT devices the right to access and share data generated by those devices. Art.5 allows third-party access to that data when contractually authorized. If your high-risk AI training pipeline uses data from connected products (medical sensors, industrial IoT, smart home devices, connected vehicles), the EU Data Act creates provenance obligations that intersect with AI Act Art.10.

The intersection (key scenarios):

Scenario	EU Data Act Obligation	AI Act Art.10 Obligation	Combined Risk
Training medical AI on patient-connected device data	Art.4 data sharing right (patient can revoke)	Art.10(2)(b) representativeness + Art.10(3) bias	Revocation after training = dataset representativeness change
Training employment AI on employee productivity sensor data	Art.5 third-party sharing requires explicit contract	Art.10(2)(a) design choices — collection method	Unauthorized collection = Art.10 violation + Data Act Art.5 breach
Training critical infrastructure AI on industrial IoT streams	Art.4-5 user rights for IoT-generated data	Art.10(2)(b)+(d) relevance + representativeness	Data Act B2G access (Art.9-15) may override your Art.10 dataset integrity

@dataclass
class IoTDataProvenanceRecord:
    """
    Document IoT training data provenance for dual EU Data Act + AI Act Art.10 compliance.
    
    Required when training data includes data from 'connected products' or 'related services'
    within the scope of EU Data Act Reg. 2023/2854 (applicable Sept 2025).
    """
    device_type: str  # "medical_sensor", "industrial_iot", "smart_home", "connected_vehicle"
    eu_data_act_applicable: bool  # True if device is "connected product" under Art.2(5) Data Act
    
    # EU Data Act compliance
    data_access_basis: str  # "user_consent_art4", "third_party_contract_art5", "b2g_art9"
    user_revocation_mechanism: bool  # Users can revoke access (Art.4 Data Act)
    revocation_impact_on_training: str  # What happens to training data if user revokes
    
    # AI Act Art.10 integration
    data_act_gdpr_intersection: bool  # True if IoT data also contains personal data
    art10_2a_collection_documented: bool  # Design choice: IoT collection method documented
    
    def check_compliance(self) -> dict:
        issues = []
        
        if self.eu_data_act_applicable and not self.user_revocation_mechanism:
            issues.append("EU Data Act Art.4: User revocation mechanism required for connected product data")
        
        if not self.art10_2a_collection_documented:
            issues.append("AI Act Art.10(2)(a): IoT data collection method must be documented as design choice")
        
        if self.data_act_gdpr_intersection and self.data_access_basis not in ["user_consent_art4", "third_party_contract_art5"]:
            issues.append("GDPR × EU Data Act: Personal IoT data requires dual lawful basis")
        
        return {
            "status": "COMPLIANT" if not issues else "ISSUES_FOUND",
            "issues": issues,
            "remediation_required": len(issues) > 0
        }

Annex IV Technical Documentation: Art.10 Requirements

Every Art.10 compliance action generates documentation that must appear in the Annex IV Technical Documentation package. Under Art.11, this documentation must be maintained for 10 years and made available to national market surveillance authorities on request.

Minimum Art.10 content for Annex IV:

Section	Required Content	Art.10 Source
2.1 Dataset description	Sources, collection method, size, preprocessing steps	10(2)(a)
2.2 Representativeness assessment	Statistical analysis of demographic coverage	10(2)(b)
2.3 Data quality measures	Error rate, completeness analysis, validation methodology	10(2)(c)/(d)
2.4 Bias examination report	Per-characteristic bias metrics, DIR, equalized odds	10(3)
2.5 Data gap documentation	Protected characteristics without direct data + proxy analysis	10(4)
2.6 Special category data processing	If Art.10(5) used: conditions met, deletion record	10(5)
2.7 Retention policy	GDPR × AI Act dual retention framework	10(2)(f) × Art.11

What EU-Native Infrastructure Means for Art.10

Training pipelines on US-hosted infrastructure create a structural Art.10 complication that EU-native deployments avoid.

The CLOUD Act exposure: When training data — including special category data processed under Art.10(5), bias examination records, and data gap documentation — is stored on US-based infrastructure (AWS, GCP, Azure, or their EU regions governed by US parent companies), the US CLOUD Act (18 U.S.C. § 2713) allows US law enforcement to compel production of that data, including data held on EU servers, without MLAT proceedings and without notifying the EU data subject or the EU national authority.

Why this matters for Art.10:

Art.10(5) special category data processed for debiasing is among the most sensitive data types in the GDPR taxonomy. Processing it under CLOUD Act-exposed infrastructure creates a risk that EU MSAs would not accept as meeting the "appropriate safeguards" condition.
Art.10(3) bias examination records may contain aggregated statistics about racial or ethnic composition of training populations. On US infrastructure, these records are CLOUD Act-accessible.
Art.11 10-year technical documentation retention on US infrastructure means 10 years of CLOUD Act exposure for your AI system's compliance documentation.

EU-native training pipeline: When training, validation, testing, and documentation all occur on EU-native infrastructure (German law, no US parent company, GDPR as sole applicable data law), Art.10(5) processing occurs in a single legal regime. The CLOUD Act parallel access risk is structurally absent — you get Art.10 compliance without the jurisdictional complexity.

Conformity Assessment and Art.10

Art.10 compliance is assessed as part of the full conformity assessment for high-risk AI systems:

Self-assessment (most Annex III categories): The provider conducts and documents Art.10 compliance internally, retaining evidence in the Annex IV package
Notified Body (specific categories): Biometric RBI systems and Annex I product AI require third-party assessment, which includes Art.10 review

National MSAs (Germany: BNetzA/BAFin by sector, France: CNIL/ANSSI by sector) may request the Annex IV documentation for spot-checks. The absence of a complete Art.10(3) bias examination report for every dataset — including datasets where no bias was found — is the most common reason for conformity assessment failure.

The documentation-first principle: Art.10 compliance is not primarily about eliminating bias — it is about demonstrating that you examined for it. A system with documented, measured, mitigated residual bias scores better in conformity assessment than a system with no documentation at all.

Checklist: Art.10 Implementation for High-Risk AI

Before August 2026:

Dataset inventory completed (all training, validation, test datasets documented)
Data source provenance documented per Art.10(2)(a) for all datasets
Representativeness analysis completed per Art.10(2)(b) with statistical evidence
Bias examination report generated per Art.10(3) for all EU Charter Art.21 characteristics
Data gap documentation written per Art.10(4) for all characteristics without direct data
Proxy variable analysis completed for characteristics without direct data
If Art.10(5) used: all four conditions documented, deletion schedule confirmed
GDPR × Art.10 dual retention policy implemented (permanent metadata, transient personal data)
If IoT training data used: EU Data Act Art.4-5 compliance verified
Annex IV technical documentation package includes all 7 Art.10 sections
10-year documentation retention system operational (no personal data in retained docs)