EU AI Act Art.10 Training Data Governance: Developer Guide (Bias Examination, Data Gaps, Special Category Data)
Article 10 of the EU AI Act (Regulation 2024/1689) is the provision that most directly regulates what data you can use to train high-risk AI systems, how you must document it, and what you must do when your datasets are incomplete or potentially biased. It is one of the least-understood technical obligations in the Act — yet it will determine whether your conformity assessment succeeds or fails.
High-risk AI systems under Annex III (biometrics, critical infrastructure, education, employment, essential services, law enforcement, migration, justice) must comply with Art.10 by 2 August 2026. Systems embedded in regulated products (Annex I) face the 2027 deadline. The Art.10(5) special category data exception is already operational for providers that can justify its use.
This guide covers every sub-paragraph of Art.10, the critical intersection with GDPR's storage limitation principle, the EU Data Act's impact on IoT training data provenance, and how training pipeline architecture determines whether you operate in one legal regime or two.
Article 10 in the AI Act Framework
Art.10 sits within Chapter 2 of the AI Act (Obligations for Providers of High-Risk AI Systems), alongside:
| Article | Obligation |
|---|---|
| Art.9 | Risk management system (lifecycle-spanning) |
| Art.10 | Data governance and management practices |
| Art.11 | Technical documentation (Annex IV) |
| Art.12 | Record-keeping and logging |
| Art.13 | Transparency and provision of information |
| Art.14 | Human oversight |
| Art.15 | Accuracy, robustness, cybersecurity |
Art.10 applies to training, validation, and testing datasets. The obligations are identical for all three dataset types — not just training data. A system that passes training data governance but uses a biased validation set has a structural Art.10 failure.
Art.10(1) — Scope: Training, Validation, and Testing Datasets
Art.10(1) establishes that providers must implement data governance and management practices covering training, validation, and testing datasets used for high-risk AI systems.
What counts as a "dataset" for Art.10:
- Datasets used to train model weights (training datasets)
- Datasets used to tune hyperparameters or select model architecture (validation datasets)
- Datasets used to evaluate final model performance (test datasets)
- Fine-tuning datasets applied to foundation models for specific high-risk applications
- Retrieval-Augmented Generation (RAG) corpora where the retrieval component is part of the high-risk AI output
What is excluded:
- GPAI model training datasets (governed separately by Art.53, not Art.10 — though downstream high-risk fine-tuning datasets do fall under Art.10)
- Inference-time user data that does not modify model weights
Art.10(2) — Data Governance Requirements: The Six Obligations
Art.10(2)(a)–(f) imposes six data governance requirements. All must be documented in your Art.11 technical documentation (Annex IV).
Art.10(2)(a) — Relevant Design Choices
Data governance must cover relevant design choices concerning the data, including:
- Data sources (origin, collection method, collection date)
- Data collection and labeling procedures
- Assumptions made about data collection context
- Data exclusions (what was excluded and why)
- Preprocessing steps (normalization, cleaning, transformation)
Practical implication: Every design decision that shaped what data entered your training pipeline must be documented. This is not a post-hoc exercise — the AI Act expects these decisions to be made consciously during data collection, not reconstructed afterward.
import json
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Optional, List
@dataclass
class DataDesignDecision:
"""
Art.10(2)(a) — Document relevant design choices for training data.
Must be included in Annex IV Technical Documentation.
"""
decision_id: str
decision_type: str # "inclusion_criterion", "exclusion_criterion", "preprocessing", "labeling"
description: str
rationale: str
date_decided: str
decided_by: str # Role, not personal name (GDPR compliance)
impact_on_representativeness: Optional[str] = None
@dataclass
class DataSourceRecord:
"""Art.10(2)(a) — Document each data source."""
source_id: str
source_type: str # "proprietary", "public", "licensed", "synthetic", "iot_device"
collection_method: str # "scraped", "surveyed", "sensor", "api", "purchased"
collection_date_start: str
collection_date_end: str
geographic_scope: str # "EU", "DE", "global", etc.
demographic_scope: Optional[str] = None
eu_data_act_applicable: bool = False # True if IoT-generated (EU Data Act Art.4-5)
gdpr_legal_basis: Optional[str] = None # Required if personal data included
class Art10DataGovernanceLog:
"""
Tracks Art.10(2)(a)-(f) compliance for a high-risk AI training pipeline.
Output feeds into Art.11 Annex IV Technical Documentation.
"""
def __init__(self, system_name: str, annex_iii_category: str):
self.system_name = system_name
self.annex_iii_category = annex_iii_category
self.data_sources: List[DataSourceRecord] = []
self.design_decisions: List[DataDesignDecision] = []
self.created_at = datetime.utcnow().isoformat()
def add_data_source(self, source: DataSourceRecord):
self.data_sources.append(source)
def add_design_decision(self, decision: DataDesignDecision):
self.design_decisions.append(decision)
def export_annex_iv_section(self) -> dict:
"""Export Art.10(2)(a) compliance data for Annex IV documentation."""
return {
"article": "10(2)(a)",
"system": self.system_name,
"annex_iii_category": self.annex_iii_category,
"data_sources": [asdict(s) for s in self.data_sources],
"design_decisions": [asdict(d) for d in self.design_decisions],
"generated_at": datetime.utcnow().isoformat(),
}
Art.10(2)(b) — Relevance and Representativeness
Training, validation, and testing datasets must be:
- Relevant to the intended purpose of the AI system
- Representative of the population or context in which the system will operate
- Free of errors to the extent possible
- Complete with respect to the characteristics needed for the intended purpose
The representativeness requirement is the most technically demanding. For an Annex III Category 4 employment AI system that screens CVs in Germany, your training dataset must be representative of:
- The demographic distribution of workers eligible for the role in Germany
- The distribution of qualifications, languages, and work history formats in the German labor market
- The temporal distribution (hiring patterns from periods relevant to current conditions)
A training dataset composed exclusively of historical successful hires is structurally unrepresentative: it excludes the "counterfactual negative" population (qualified candidates who were not hired for reasons unrelated to merit). This is both an Art.10(2)(b) violation and a direct path to Art.10(3) discriminatory bias.
from scipy import stats
import numpy as np
from typing import Dict, Any
class RepresentativenessAnalyzer:
"""
Art.10(2)(b) — Assess whether training datasets are sufficiently representative.
Statistical testing approach for EU AI Act compliance.
"""
def __init__(self, reference_population: Dict[str, float], alpha: float = 0.05):
"""
reference_population: Expected distribution in target deployment context.
E.g., {"age_18_30": 0.35, "age_31_50": 0.45, "age_51_plus": 0.20}
alpha: Significance level for chi-square test (default 0.05 = 95% confidence).
"""
self.reference = reference_population
self.alpha = alpha
def check_demographic_representativeness(
self,
dataset_distribution: Dict[str, float],
sample_size: int
) -> dict:
"""
Chi-square goodness-of-fit test: Is dataset distribution consistent with reference?
Returns PASS/FAIL with p-value for Annex IV documentation.
"""
categories = list(self.reference.keys())
expected_counts = [self.reference[c] * sample_size for c in categories]
observed_counts = [dataset_distribution.get(c, 0) * sample_size for c in categories]
chi2_stat, p_value = stats.chisquare(observed_counts, f_exp=expected_counts)
return {
"test": "chi-square goodness-of-fit",
"article": "10(2)(b) representativeness",
"chi2_statistic": round(chi2_stat, 4),
"p_value": round(p_value, 4),
"alpha": self.alpha,
"result": "PASS" if p_value > self.alpha else "FAIL",
"interpretation": (
"Dataset distribution is not significantly different from reference population"
if p_value > self.alpha
else f"Dataset distribution significantly deviates from reference (p={p_value:.4f} < {self.alpha}) — Art.10(2)(b) violation risk"
),
"underrepresented_groups": [
c for c in categories
if dataset_distribution.get(c, 0) < self.reference[c] * 0.7 # <70% of expected
]
}
def check_temporal_coverage(
self,
dataset_date_range: tuple, # (start_year, end_year)
deployment_context_year: int,
max_staleness_years: int = 5
) -> dict:
"""Check whether training data is temporally relevant to deployment context."""
_, end_year = dataset_date_range
staleness = deployment_context_year - end_year
return {
"article": "10(2)(b) relevance",
"training_data_end_year": end_year,
"deployment_year": deployment_context_year,
"staleness_years": staleness,
"result": "PASS" if staleness <= max_staleness_years else "FAIL",
"risk": (
"Temporal distribution gap may cause performance degradation in current context"
if staleness > max_staleness_years else None
)
}
Art.10(2)(c) — Appropriate Statistical Properties
Datasets must have appropriate statistical properties for the system's intended purpose, including the persons or groups of persons on which the high-risk AI system is intended to be used.
This provision operationalizes the concept that statistical sufficiency is purpose-specific. An employment AI system used across 27 EU member states requires statistical properties appropriate for each national labor market — a single aggregate distribution is not sufficient.
Minimum sample size guidance (from NIST AI 100-1 and EU AI Act technical standards bodies):
- For binary classification in high-risk contexts: minimum 1,000 samples per demographic subgroup evaluated
- For minority group performance assessment: minimum 100 samples per group for statistical significance at α=0.05
- For intersectional analysis (e.g., gender × age × nationality): minimum 50 samples per intersection cell
Art.10(2)(d)–(f) — Additional Requirements
Art.10(2)(d) — Suitability to Intended Purpose: Data must be suitable for the AI system's specific intended use. A biometric system trained on adult faces cannot be deployed for child identification without separate compliance assessment.
Art.10(2)(e) — Examination for Biases: Datasets must be examined for possible biases that are likely to affect the health and safety of persons or to lead to discrimination prohibited under Union law (Art.21 Charter, GDPR Art.9, anti-discrimination directives). This obligation is developed further in Art.10(3) — the most technically complex provision.
Art.10(2)(f) — Proportionate Data Collection Measures: Where personal data is included, appropriate data management practices consistent with GDPR must be applied. This creates the Art.10(2)(f) × GDPR Art.5(1)(e) retention conflict discussed below.
Art.10(3) — Bias Examination: The Technical Core
Art.10(3) requires providers to examine training, validation, and testing datasets for possible biases that could lead to:
- Discrimination against persons based on characteristics protected under Art.21 EU Charter (sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political opinion, disability, age, sexual orientation, nationality)
- Discrimination prohibited under sector-specific EU law (employment directives, equal treatment directives)
This is not optional. Art.10(3) is a mandatory obligation for all high-risk AI providers. The absence of bias examination documentation is itself an Art.10 violation — you do not need to find bias to fail compliance, you need to fail to look for it.
from dataclasses import dataclass, field
from typing import List, Tuple, Optional
import numpy as np
@dataclass
class BiasExaminationReport:
"""
Art.10(3) — Bias examination report for Annex IV Technical Documentation.
Must be completed for EVERY training, validation, and test dataset.
"""
dataset_id: str
examination_date: str
examiner_role: str # Not personal name — role (e.g., "Data Governance Lead")
# Protected characteristics examined (Art.21 EU Charter)
characteristics_examined: List[str] = field(default_factory=list)
# Bias metrics used
metrics_applied: List[str] = field(default_factory=list)
# Findings per characteristic
findings: List[dict] = field(default_factory=list)
# Mitigation measures applied
mitigations_applied: List[str] = field(default_factory=list)
# Residual risk assessment
residual_bias_risk: str = "NOT_ASSESSED" # LOW | MEDIUM | HIGH | NOT_ASSESSED
class Art10BiasExamination:
"""
Systematic bias examination framework for Art.10(3) compliance.
Covers: disparate impact, equalized odds, calibration, representation parity.
All findings must be documented in the Art.11 Technical Documentation (Annex IV).
"""
EU_CHARTER_ART21_CHARACTERISTICS = [
"sex", "race", "colour", "ethnic_or_social_origin", "genetic_features",
"language", "religion_or_belief", "political_opinion", "disability",
"age", "sexual_orientation", "nationality"
]
def compute_disparate_impact(
self,
y_pred: np.ndarray,
protected_attribute: np.ndarray,
favorable_outcome: int = 1
) -> Tuple[float, str]:
"""
Disparate Impact Ratio (DIR): P(Y=1|A=0) / P(Y=1|A=1)
where A=1 is the privileged group.
EU threshold guidance: DIR < 0.8 indicates potential discrimination
(aligns with US EEOC 4/5ths rule, referenced in EU employment directives).
"""
groups = np.unique(protected_attribute)
if len(groups) != 2:
return None, "Cannot compute DIR for non-binary protected attribute"
group_0_rate = np.mean(y_pred[protected_attribute == groups[0]] == favorable_outcome)
group_1_rate = np.mean(y_pred[protected_attribute == groups[1]] == favorable_outcome)
# Use minority group as numerator
dir_ratio = min(group_0_rate, group_1_rate) / max(group_0_rate, group_1_rate)
if dir_ratio < 0.8:
status = f"POTENTIAL_DISCRIMINATION (DIR={dir_ratio:.3f} < 0.8 threshold)"
elif dir_ratio < 0.9:
status = f"BORDERLINE (DIR={dir_ratio:.3f}, monitor closely)"
else:
status = f"PASS (DIR={dir_ratio:.3f})"
return dir_ratio, status
def compute_equalized_odds(
self,
y_true: np.ndarray,
y_pred: np.ndarray,
protected_attribute: np.ndarray
) -> dict:
"""
Equalized Odds: Equal TPR and FPR across protected groups.
Critical for law enforcement (Art.10(3) — high-stakes decision AI).
"""
groups = np.unique(protected_attribute)
results = {}
for group in groups:
mask = protected_attribute == group
true_pos = np.sum((y_true[mask] == 1) & (y_pred[mask] == 1))
false_neg = np.sum((y_true[mask] == 1) & (y_pred[mask] == 0))
true_neg = np.sum((y_true[mask] == 0) & (y_pred[mask] == 0))
false_pos = np.sum((y_true[mask] == 0) & (y_pred[mask] == 1))
tpr = true_pos / (true_pos + false_neg) if (true_pos + false_neg) > 0 else 0
fpr = false_pos / (false_pos + true_neg) if (false_pos + true_neg) > 0 else 0
results[str(group)] = {"tpr": round(tpr, 4), "fpr": round(fpr, 4)}
# Check for equalized odds violation
tpr_values = [r["tpr"] for r in results.values()]
fpr_values = [r["fpr"] for r in results.values()]
tpr_gap = max(tpr_values) - min(tpr_values)
fpr_gap = max(fpr_values) - min(fpr_values)
results["equalized_odds_assessment"] = {
"tpr_gap": round(tpr_gap, 4),
"fpr_gap": round(fpr_gap, 4),
"result": "PASS" if tpr_gap < 0.05 and fpr_gap < 0.05 else "FAIL",
"article": "10(3) bias examination — equalized odds"
}
return results
def generate_examination_report(
self,
dataset_id: str,
characteristics_with_data: List[str],
characteristics_without_data: List[str]
) -> BiasExaminationReport:
"""
Generate Art.10(3) examination report structure.
Both characteristics WITH data (tested) and WITHOUT data (Art.10(4) gap documentation)
must be documented.
"""
all_eu_characteristics = set(self.EU_CHARTER_ART21_CHARACTERISTICS)
examined = set(characteristics_with_data)
gaps = set(characteristics_without_data)
# Any characteristic not examined and not documented as gap = compliance risk
undocumented = all_eu_characteristics - examined - gaps
return BiasExaminationReport(
dataset_id=dataset_id,
examination_date=datetime.utcnow().isoformat(),
examiner_role="Data Governance Team",
characteristics_examined=characteristics_with_data,
metrics_applied=["disparate_impact_ratio", "equalized_odds", "calibration"],
findings=[
{
"characteristic": c,
"data_available": True,
"metrics_computed": ["DIR", "equalized_odds"],
"result": "pending_computation"
}
for c in characteristics_with_data
] + [
{
"characteristic": c,
"data_available": False,
"art10_4_gap_documented": True,
"mitigation": "proxy_variable_analysis"
}
for c in characteristics_without_data
],
residual_bias_risk="NOT_ASSESSED" if undocumented else "PENDING_REVIEW"
)
Art.10(4) — Data Gaps: Documenting What You Cannot Measure
Art.10(4) addresses a practical reality: for many protected characteristics (religion, sexual orientation, political opinion), direct data collection is either legally prohibited under GDPR or practically impossible. The provision requires that where such data is not available, providers must:
- Document that relevant data is not available in the technical documentation
- Identify potential biases that could arise despite the absence of direct data
- Implement mitigation measures that are reasonably practicable
Why this matters: The absence of protected characteristic data does not prevent bias. Proxy variables (zip code as a proxy for race, name as a proxy for nationality, job title as a proxy for age) can encode discriminatory correlations even when the protected attribute is never explicitly included. Art.10(4) requires documenting this risk.
@dataclass
class DataGapDocumentation:
"""
Art.10(4) — Document missing protected characteristic data.
Required even when data cannot be collected (GDPR Art.9 prohibition).
"""
characteristic: str # Protected characteristic with no direct data
reason_unavailable: str # "gdpr_art9_prohibition" | "practical_infeasibility" | "not_applicable"
proxy_variables_identified: List[str] # Variables that may encode the characteristic
proxy_bias_risk: str # Description of how proxies could create discriminatory outcomes
mitigation_measures: List[str]
residual_risk: str # LOW | MEDIUM | HIGH
class ProxyVariableAnalyzer:
"""
Art.10(4) — Identify proxy variables that may encode protected characteristics.
Common proxy correlations documented in EU anti-discrimination case law:
- zip_code → ethnicity, social class (spatial segregation)
- first_name → nationality, religion, ethnicity
- employment_gap → parenthood/disability
- educational_institution → social class, religion
- language_proficiency → national origin
"""
DOCUMENTED_PROXIES = {
"race": ["zip_code", "postal_code", "neighborhood", "school_attended", "first_name"],
"religion": ["first_name", "last_name", "dietary_preference", "country_of_origin"],
"nationality": ["first_name", "last_name", "language_of_cv", "phone_country_code"],
"age": ["graduation_year", "years_of_experience", "technology_stack_age"],
"sex": ["first_name", "career_gap", "part_time_employment_history"],
"disability": ["employment_gap", "special_needs_mentions", "accommodation_requests"],
"sexual_orientation": ["relationship_status", "same_sex_partner_references"],
"ethnic_or_social_origin": ["zip_code", "school_attended", "name_etymology"],
}
def identify_proxies_in_feature_set(self, feature_names: List[str]) -> dict:
"""
Identify which features in your model may proxy for protected characteristics.
Returns Art.10(4) gap documentation for all characteristics without direct data.
"""
proxy_risks = {}
for characteristic, known_proxies in self.DOCUMENTED_PROXIES.items():
proxies_present = [f for f in feature_names if any(p in f.lower() for p in known_proxies)]
if proxies_present:
proxy_risks[characteristic] = {
"characteristic": characteristic,
"direct_data_available": False,
"article_10_4_gap": True,
"proxy_features_detected": proxies_present,
"bias_risk": "HIGH" if len(proxies_present) > 2 else "MEDIUM",
"required_mitigation": [
"Proxy correlation analysis",
"Counterfactual fairness testing",
"Feature importance analysis for proxy exclusion decision"
]
}
return proxy_risks
Art.10(5) — Special Category Data Exception for Debiasing
Art.10(5) is the most carefully scoped provision in Article 10. It permits processing of special category data (GDPR Art.9(1): racial or ethnic origin, political opinions, religious beliefs, biometric data, health data, sexual orientation) for the purposes of bias detection and correction in high-risk AI systems — but only under strict cumulative conditions:
- Strict necessity: Processing must be strictly necessary for bias detection and correction in relation to a specific high-risk AI system
- Appropriate safeguards: Technical and organizational measures must be in place for the rights and freedoms of data subjects
- Deletion or anonymization: The special category data must be deleted or anonymized once bias correction is complete (or retained only for the duration strictly necessary)
- Technical + organizational safeguards: Access controls, purpose limitation, and documentation must be documented
This is not a general exception for training data collection. Art.10(5) does not permit building a training dataset that includes special category data for enrichment. It permits temporary processing of such data specifically to audit and correct bias in an already-developed system.
@dataclass
class Art10_5_ProcessingRecord:
"""
Art.10(5) — Record of special category data processing for debiasing.
This record must be retained as evidence of compliance.
"""
processing_id: str
high_risk_system_id: str
annex_iii_category: str
# Special category data processed (GDPR Art.9(1))
special_categories_used: List[str] # e.g., ["racial_ethnic_origin", "disability_status"]
# Art.10(5) conditions met
strict_necessity_justification: str
bias_type_being_corrected: str
bias_metric_before: dict
bias_metric_after: dict
# Safeguards
access_controls_in_place: List[str]
purpose_limitation_enforced: bool
# Deletion
deletion_date: Optional[str] = None # Required when debiasing is complete
anonymization_method: Optional[str] = None # If retained anonymized
class Art10_5_ComplianceChecker:
"""
Verify that Art.10(5) special category data processing meets all cumulative conditions.
All four conditions must be met simultaneously — failure of any one = violation.
"""
def verify_conditions(self, processing_record: Art10_5_ProcessingRecord) -> dict:
results = {}
# Condition 1: Strict necessity
results["strict_necessity"] = {
"met": bool(processing_record.strict_necessity_justification),
"evidence": processing_record.strict_necessity_justification,
"guidance": "Processing must not be achievable through less intrusive means"
}
# Condition 2: Appropriate safeguards
required_safeguards = ["access_control", "audit_log", "purpose_limitation", "need_to_know"]
safeguards_met = all(
any(s.lower() in provided.lower() for provided in processing_record.access_controls_in_place)
for s in required_safeguards
)
results["appropriate_safeguards"] = {
"met": safeguards_met and processing_record.purpose_limitation_enforced,
"safeguards_documented": processing_record.access_controls_in_place,
"purpose_limitation": processing_record.purpose_limitation_enforced
}
# Condition 3: Deletion or anonymization after use
results["deletion_or_anonymization"] = {
"met": bool(processing_record.deletion_date or processing_record.anonymization_method),
"deletion_date": processing_record.deletion_date,
"anonymization_method": processing_record.anonymization_method,
"guidance": "Data must be deleted or anonymized once debiasing is complete"
}
# Condition 4: Demonstrable bias improvement
has_metrics = bool(processing_record.bias_metric_before and processing_record.bias_metric_after)
results["demonstrated_improvement"] = {
"met": has_metrics,
"metrics_before": processing_record.bias_metric_before,
"metrics_after": processing_record.bias_metric_after,
"guidance": "Art.10(5) requires strict necessity — must demonstrate the processing achieved its objective"
}
all_conditions_met = all(r["met"] for r in results.values())
results["overall_compliance"] = "PASS" if all_conditions_met else "FAIL"
results["failing_conditions"] = [k for k, v in results.items() if not v.get("met", True)]
return results
The GDPR × Art.10 Data Retention Conflict
The most practically significant legal tension in Art.10 is its interaction with GDPR Art.5(1)(e) — the storage limitation principle.
GDPR Art.5(1)(e) requires that personal data be kept in a form that permits identification of data subjects for no longer than necessary for the purpose for which it was collected.
AI Act Art.10 + Art.11 requires that technical documentation of training data, including data characteristics and bias examination records, be retained for 10 years after the AI system is placed on the market or put into service (Annex IV, Art.11(3)).
The conflict: If your training dataset includes personal data (names, faces, behavioral traces), GDPR requires deletion when the training purpose is complete. But the AI Act requires retaining documentation of those datasets for 10 years.
The resolution (adopted by the European Data Protection Board — Art.10(6) interpretation):
| What must be retained (10 years) | What must be deleted (post-training) |
|---|---|
| Dataset metadata (size, structure, provenance, statistical properties) | Individual records containing personal data |
| Bias examination reports (aggregate statistics only) | Raw personal data used in bias examination |
| Sampling methodology documentation | Special category data after Art.10(5) debiasing |
| Data source identification (anonymous or pseudonymous) | Personally identifiable training samples |
| Preprocessing pipeline code and parameters | Personally identifiable validation/test samples |
Practical implementation: Maintain a two-tier documentation system — a permanent Art.11 technical documentation package (metadata only, no personal data) and a transient training pipeline that processes personal data under GDPR purposes limitation and deletes after training completion.
class RetentionPolicyManager:
"""
Implement the GDPR Art.5(1)(e) × AI Act Art.10/11 dual retention policy.
Separates permanent technical documentation (10yr) from personal data (delete post-training).
"""
PERMANENT_DOCUMENTATION = {
"dataset_schema": "10 years from market placement",
"statistical_summary": "10 years from market placement",
"bias_examination_aggregate_results": "10 years from market placement",
"data_source_provenance_metadata": "10 years from market placement",
"preprocessing_pipeline_code": "10 years from market placement",
"design_decision_log": "10 years from market placement",
"art10_5_processing_records": "10 years from market placement",
}
TRANSIENT_DATA = {
"personal_data_training_records": "Delete after training completion",
"personal_data_validation_records": "Delete after model selection",
"personal_data_test_records": "Delete after performance evaluation",
"special_category_data_debiasing": "Delete after Art.10(5) debiasing process",
"raw_biometric_training_data": "Delete after feature extraction",
}
def classify_data_asset(self, asset_name: str, contains_personal_data: bool) -> dict:
"""Classify a data asset for retention policy under dual GDPR/AI Act regime."""
# Check if asset is permanent technical documentation
for doc_type, retention in self.PERMANENT_DOCUMENTATION.items():
if doc_type in asset_name.lower():
return {
"asset": asset_name,
"retention_regime": "AI Act Art.11 — Technical Documentation",
"retention_period": retention,
"personal_data_permitted": False, # Must contain no personal data
"action": "Anonymize if currently contains personal data, then retain 10yr"
}
# Personal data assets
if contains_personal_data:
return {
"asset": asset_name,
"retention_regime": "GDPR Art.5(1)(e) — Storage Limitation",
"retention_period": "Delete after training/validation/testing purpose fulfilled",
"action": "Schedule deletion after training pipeline completion",
"gdpr_article": "5(1)(e)"
}
return {
"asset": asset_name,
"retention_regime": "No personal data — standard business retention",
"action": "Apply organization's standard data lifecycle policy"
}
EU Data Act × Art.10: IoT Training Data Provenance
The EU Data Act (Regulation 2023/2854, applicable since September 2025) creates a new compliance layer for AI systems trained on IoT-generated data.
EU Data Act Art.4-5 grants users of IoT devices the right to access and share data generated by those devices. Art.5 allows third-party access to that data when contractually authorized. If your high-risk AI training pipeline uses data from connected products (medical sensors, industrial IoT, smart home devices, connected vehicles), the EU Data Act creates provenance obligations that intersect with AI Act Art.10.
The intersection (key scenarios):
| Scenario | EU Data Act Obligation | AI Act Art.10 Obligation | Combined Risk |
|---|---|---|---|
| Training medical AI on patient-connected device data | Art.4 data sharing right (patient can revoke) | Art.10(2)(b) representativeness + Art.10(3) bias | Revocation after training = dataset representativeness change |
| Training employment AI on employee productivity sensor data | Art.5 third-party sharing requires explicit contract | Art.10(2)(a) design choices — collection method | Unauthorized collection = Art.10 violation + Data Act Art.5 breach |
| Training critical infrastructure AI on industrial IoT streams | Art.4-5 user rights for IoT-generated data | Art.10(2)(b)+(d) relevance + representativeness | Data Act B2G access (Art.9-15) may override your Art.10 dataset integrity |
@dataclass
class IoTDataProvenanceRecord:
"""
Document IoT training data provenance for dual EU Data Act + AI Act Art.10 compliance.
Required when training data includes data from 'connected products' or 'related services'
within the scope of EU Data Act Reg. 2023/2854 (applicable Sept 2025).
"""
device_type: str # "medical_sensor", "industrial_iot", "smart_home", "connected_vehicle"
eu_data_act_applicable: bool # True if device is "connected product" under Art.2(5) Data Act
# EU Data Act compliance
data_access_basis: str # "user_consent_art4", "third_party_contract_art5", "b2g_art9"
user_revocation_mechanism: bool # Users can revoke access (Art.4 Data Act)
revocation_impact_on_training: str # What happens to training data if user revokes
# AI Act Art.10 integration
data_act_gdpr_intersection: bool # True if IoT data also contains personal data
art10_2a_collection_documented: bool # Design choice: IoT collection method documented
def check_compliance(self) -> dict:
issues = []
if self.eu_data_act_applicable and not self.user_revocation_mechanism:
issues.append("EU Data Act Art.4: User revocation mechanism required for connected product data")
if not self.art10_2a_collection_documented:
issues.append("AI Act Art.10(2)(a): IoT data collection method must be documented as design choice")
if self.data_act_gdpr_intersection and self.data_access_basis not in ["user_consent_art4", "third_party_contract_art5"]:
issues.append("GDPR × EU Data Act: Personal IoT data requires dual lawful basis")
return {
"status": "COMPLIANT" if not issues else "ISSUES_FOUND",
"issues": issues,
"remediation_required": len(issues) > 0
}
Annex IV Technical Documentation: Art.10 Requirements
Every Art.10 compliance action generates documentation that must appear in the Annex IV Technical Documentation package. Under Art.11, this documentation must be maintained for 10 years and made available to national market surveillance authorities on request.
Minimum Art.10 content for Annex IV:
| Section | Required Content | Art.10 Source |
|---|---|---|
| 2.1 Dataset description | Sources, collection method, size, preprocessing steps | 10(2)(a) |
| 2.2 Representativeness assessment | Statistical analysis of demographic coverage | 10(2)(b) |
| 2.3 Data quality measures | Error rate, completeness analysis, validation methodology | 10(2)(c)/(d) |
| 2.4 Bias examination report | Per-characteristic bias metrics, DIR, equalized odds | 10(3) |
| 2.5 Data gap documentation | Protected characteristics without direct data + proxy analysis | 10(4) |
| 2.6 Special category data processing | If Art.10(5) used: conditions met, deletion record | 10(5) |
| 2.7 Retention policy | GDPR × AI Act dual retention framework | 10(2)(f) × Art.11 |
What EU-Native Infrastructure Means for Art.10
Training pipelines on US-hosted infrastructure create a structural Art.10 complication that EU-native deployments avoid.
The CLOUD Act exposure: When training data — including special category data processed under Art.10(5), bias examination records, and data gap documentation — is stored on US-based infrastructure (AWS, GCP, Azure, or their EU regions governed by US parent companies), the US CLOUD Act (18 U.S.C. § 2713) allows US law enforcement to compel production of that data, including data held on EU servers, without MLAT proceedings and without notifying the EU data subject or the EU national authority.
Why this matters for Art.10:
- Art.10(5) special category data processed for debiasing is among the most sensitive data types in the GDPR taxonomy. Processing it under CLOUD Act-exposed infrastructure creates a risk that EU MSAs would not accept as meeting the "appropriate safeguards" condition.
- Art.10(3) bias examination records may contain aggregated statistics about racial or ethnic composition of training populations. On US infrastructure, these records are CLOUD Act-accessible.
- Art.11 10-year technical documentation retention on US infrastructure means 10 years of CLOUD Act exposure for your AI system's compliance documentation.
EU-native training pipeline: When training, validation, testing, and documentation all occur on EU-native infrastructure (German law, no US parent company, GDPR as sole applicable data law), Art.10(5) processing occurs in a single legal regime. The CLOUD Act parallel access risk is structurally absent — you get Art.10 compliance without the jurisdictional complexity.
Conformity Assessment and Art.10
Art.10 compliance is assessed as part of the full conformity assessment for high-risk AI systems:
- Self-assessment (most Annex III categories): The provider conducts and documents Art.10 compliance internally, retaining evidence in the Annex IV package
- Notified Body (specific categories): Biometric RBI systems and Annex I product AI require third-party assessment, which includes Art.10 review
National MSAs (Germany: BNetzA/BAFin by sector, France: CNIL/ANSSI by sector) may request the Annex IV documentation for spot-checks. The absence of a complete Art.10(3) bias examination report for every dataset — including datasets where no bias was found — is the most common reason for conformity assessment failure.
The documentation-first principle: Art.10 compliance is not primarily about eliminating bias — it is about demonstrating that you examined for it. A system with documented, measured, mitigated residual bias scores better in conformity assessment than a system with no documentation at all.
Checklist: Art.10 Implementation for High-Risk AI
Before August 2026:
- Dataset inventory completed (all training, validation, test datasets documented)
- Data source provenance documented per Art.10(2)(a) for all datasets
- Representativeness analysis completed per Art.10(2)(b) with statistical evidence
- Bias examination report generated per Art.10(3) for all EU Charter Art.21 characteristics
- Data gap documentation written per Art.10(4) for all characteristics without direct data
- Proxy variable analysis completed for characteristics without direct data
- If Art.10(5) used: all four conditions documented, deletion schedule confirmed
- GDPR × Art.10 dual retention policy implemented (permanent metadata, transient personal data)
- If IoT training data used: EU Data Act Art.4-5 compliance verified
- Annex IV technical documentation package includes all 7 Art.10 sections
- 10-year documentation retention system operational (no personal data in retained docs)
See Also
- EU AI Act Art.6 High-Risk AI Systems: Developer Guide — Classification routes, Annex III categories, and the full Art.9-15 obligation framework
- EU AI Act Art.9 Risk Management: Formal Verification Developer Guide — The lifecycle-spanning risk management system that Art.10 data governance feeds into
- EU AI Act Art.5 Prohibited Practices: Developer Guide — Six prohibited AI practices applicable since February 2025
- EU Data Act B2B Data Sharing & AI Training Developer Guide — EU Data Act obligations for IoT data that feeds into AI training pipelines
- EU AI Office & GPAI Model Regulation Developer Guide — Art.53 training data transparency for GPAI models (separate from Art.10 for high-risk AI fine-tuning)