Data Minimisation in AI: GDPR Art.5(1)(c) + EU AI Act Art.10 Developer Compliance Guide 2026
Post #4 in the sota.io EU AI Act + GDPR Intersection Series
AI systems are hungry for data. The more training data, the better the model — or so the conventional wisdom goes. But GDPR's data minimisation principle sits in direct tension with that instinct: personal data must be "adequate, relevant and limited to what is necessary" for its purpose.
This tension is now legally consequential. With EU AI Act enforcement for prohibited practices active since February 2026 and the August 2, 2026 deadline for high-risk AI obligations approaching, developers need to understand where GDPR Art.5(1)(c) intersects with EU AI Act Art.10 data governance requirements — and what that means for training datasets, inference pipelines, and production AI systems.
1. GDPR Art.5(1)(c): Data Minimisation Explained for AI
Article 5(1)(c) of the GDPR states that personal data shall be:
"adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed"
This is the data minimisation principle. It applies from the moment you collect data through to deletion. For AI developers, there are three distinct processing contexts where it applies:
| Context | Minimisation Obligation |
|---|---|
| Training data collection | Only collect personal data attributes actually needed to train the model for its stated purpose |
| Training data processing | Strip, anonymise, or pseudonymise attributes not required for the model objective |
| Inference (production) | Do not process more user data per request than required to produce the output |
The Purpose Limitation Link
Data minimisation cannot be evaluated in isolation — it is locked to the processing purpose established under Art.5(1)(b). Before you can assess whether data is "necessary," you must have a clearly documented purpose.
For AI systems, this means:
- Training purpose: "Build a credit risk model using applicant financial history" → Only financial history attributes needed, not browsing history, social media data, or family information
- Inference purpose: "Generate a summary of this support ticket" → Only the ticket content, not the full customer profile
A common violation pattern: developers ingest entire database snapshots for convenience, train on columns they didn't plan to use, and never prune the training dataset. Under GDPR, this is a Art.5(1)(c) breach even if the final model doesn't "expose" the excess data directly.
Fines Under Art.83(5)
Violations of Art.5 principles — including data minimisation — fall under Article 83(5) of the GDPR, the higher fine tier:
- Up to €20,000,000 or 4% of total worldwide annual turnover of the preceding financial year, whichever is higher.
The CNIL's 2022 €150,000 fine against Doctissimo (French health portal) explicitly cited Art.5 violations including disproportionate data retention. DPAs are beginning to apply the same scrutiny to AI training datasets.
2. EU AI Act Art.10: Data Governance Requirements
Article 10 of the EU AI Act applies to high-risk AI systems (as defined in Annex III, operative from August 2026). It establishes data governance obligations for training, validation, and testing datasets.
Art.10(2) requires that training data practices address:
- Relevant design choices
- Data collection processes
- Data preparation operations including cleaning, annotation, labelling, enrichment, and aggregation
- Formulation of relevant assumptions about the data
- Examination of the availability, quantity, and suitability of the datasets
Art.10(3) requires that training, validation, and testing datasets be:
- Relevant for the intended purpose
- Sufficiently representative (covering domain-relevant subgroups)
- Free of errors to the best extent possible
- Complete with respect to the characteristics required for the system's purpose
Art.10(4) addresses special categories of data (health, biometric, racial origin): these may only be used where strictly necessary for the purpose of detecting and correcting biases, with appropriate safeguards.
Where Art.10 and Art.5(1)(c) Align
Both requirements share a common core: fitness for purpose. Art.10(3)'s "relevant" and "sufficient" framing maps directly onto Art.5(1)(c)'s "adequate" and "relevant."
The practical effect: a dataset that passes GDPR minimisation review will generally also satisfy the relevance and suitability tests under Art.10. The reverse is also true — an overly broad dataset that fails the minimisation test will struggle to demonstrate it is "limited to what is necessary" for the AI purpose.
Where They Diverge
The EU AI Act introduces a representativeness requirement that GDPR does not. Art.10(3) requires datasets to cover relevant subgroups; too little data about a protected characteristic can create biased models — but that very characteristic may also be data you should minimise under GDPR.
This is the core tension developers must navigate:
| Concern | GDPR Direction | EU AI Act Direction |
|---|---|---|
| Gender data in training set | Minimise unless strictly necessary | Include for fairness/bias detection |
| Ethnicity/race data | Sensitive category — restrict to Art.9 basis | May be needed for representativeness |
| Age data | Limit to what the model actually uses | Required for age-sensitive applications |
Resolution: The EU AI Act Art.10(4) carveout is deliberately narrow. Special category data is allowed only for bias correction, not for general training. Document this basis explicitly and treat it as a data protection impact assessment (DPIA) trigger.
3. Training Data Minimisation: Practical Patterns
3.1 Feature Selection as a Compliance Tool
The most direct implementation of Art.5(1)(c) is treating feature selection as a compliance step, not just a modelling step.
Before training:
# Define which features are purpose-necessary
REQUIRED_FEATURES = {
"loan_default_prediction": [
"loan_amount", "loan_term_months", "income_bracket",
"employment_type", "debt_to_income_ratio", "credit_score_band"
]
# NOT included: postcode, age, marital_status, number_of_dependants
# unless purpose explicitly requires them
}
def load_training_data(purpose: str) -> pd.DataFrame:
df = raw_df[REQUIRED_FEATURES[purpose]].copy()
# Document what was excluded and why
log_minimisation_decision(
purpose=purpose,
included=REQUIRED_FEATURES[purpose],
excluded=[c for c in raw_df.columns if c not in REQUIRED_FEATURES[purpose]],
legal_basis="Article 5(1)(c) GDPR — limited to what is necessary for stated purpose"
)
return df
This audit log is directly relevant to DPIA documentation under GDPR Art.35 and EU AI Act technical documentation requirements.
3.2 Pseudonymisation Before Training
Where individual identity is not required for model training (which covers most supervised learning tasks), pseudonymise before the data enters the training pipeline.
import hashlib
import pandas as pd
def pseudonymise_training_dataset(df: pd.DataFrame, pii_columns: list[str], salt: str) -> pd.DataFrame:
"""
Replace PII identifiers with stable pseudonyms.
Salt should be stored separately (not in training data or model artifacts).
"""
df = df.copy()
for col in pii_columns:
df[col] = df[col].apply(
lambda v: hashlib.sha256(f"{salt}{v}".encode()).hexdigest()[:16]
if pd.notna(v) else None
)
return df
# Apply before any training split
df_train = pseudonymise_training_dataset(
df=raw_training_data,
pii_columns=["user_id", "email", "ip_address"],
salt=os.environ["PSEUDONYM_SALT"]
)
Under GDPR Recital 26 and Art.4(5), pseudonymised data that cannot be re-identified without additional information (held separately) reduces the effective risk level.
3.3 Differential Privacy for Statistical Noise
Where the training dataset contains sensitive attributes that cannot be fully removed, differential privacy (DP) adds calibrated noise during training that provides mathematical privacy guarantees.
For PyTorch with Opacus (open source, EU-data-centre deployable):
from opacus import PrivacyEngine
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private(
module=model,
optimizer=optimizer,
data_loader=train_loader,
noise_multiplier=1.1, # ε-δ budget: tune to your risk appetite
max_grad_norm=1.0,
)
# After training, document the privacy budget spent
epsilon, delta = privacy_engine.get_epsilon(delta=1e-5)
print(f"Training DP budget: ε={epsilon:.2f}, δ={delta}")
# Record epsilon/delta in technical documentation (EU AI Act Art.10 requirement)
Documenting the DP budget is increasingly expected in EU AI Act technical files — it demonstrates that data minimisation was actively engineered into the training process.
3.4 Retention Limits for Training Datasets
Data minimisation applies not just to which data is included but to how long it is retained. GDPR Art.5(1)(e) (storage limitation) reinforces this.
Define explicit retention schedules for:
- Raw collected data: Delete once training splits are generated
- Training/validation/test splits: Delete or archive at end of model lifecycle
- Experiment logs: Retain metadata, delete sample rows
- Model checkpoints: Each checkpoint may embed training data statistics — treat as personal data if re-identification is theoretically possible
# Training pipeline cleanup hook
def cleanup_training_artifacts(run_id: str, retain_final_model: bool = True):
paths = {
"raw_data": f"data/raw/{run_id}/",
"intermediate_splits": f"data/processed/{run_id}/",
"checkpoints": f"models/{run_id}/checkpoints/",
}
for name, path in paths.items():
if name == "checkpoints" and retain_final_model:
# Keep only the final checkpoint, delete intermediates
delete_all_except_latest(path)
else:
shutil.rmtree(path, ignore_errors=True)
logging.info(f"Deleted {name} at {path} — retention limit reached")
4. Inference-Time Minimisation
Training data compliance is only half the equation. Production AI systems process personal data at inference time on every API call.
4.1 Input Scope: What Does the Model Actually Need?
Before building the input payload for a model call, apply the same "necessary for purpose" test:
def build_model_input(user_context: dict, purpose: str) -> dict:
"""
Build a minimised input payload — only fields the model needs for this purpose.
"""
INFERENCE_FEATURE_MAP = {
"sentiment_analysis": ["message_text"],
"risk_scoring": ["transaction_amount", "merchant_category", "time_of_day"],
"document_summary": ["document_text"],
}
allowed = INFERENCE_FEATURE_MAP.get(purpose, [])
minimised = {k: v for k, v in user_context.items() if k in allowed}
audit_log.info({
"event": "inference_input_minimised",
"purpose": purpose,
"fields_included": list(minimised.keys()),
"fields_excluded": [k for k in user_context if k not in allowed],
})
return minimised
4.2 LLM Context Windows: The Hidden PII Problem
For applications using large language models, the context window is the primary minimisation risk. Developers routinely pass entire user histories, full database records, or complete document stores as context — when only a fraction is relevant to the query.
The pattern to avoid:
# PROBLEMATIC: passes entire user profile to LLM
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Customer: {json.dumps(full_customer_record)}\n\nQuestion: {user_query}"}
]
The pattern to adopt:
# COMPLIANT: retrieve-then-minimise before LLM call
def build_rag_context(user_query: str, user_id: str) -> str:
# Step 1: Retrieve relevant chunks only
relevant_chunks = vector_store.similarity_search(
query=user_query,
filter={"user_id": user_id},
k=3 # top 3 relevant chunks, not full history
)
# Step 2: Exclude chunks with sensitive categories not relevant to query
filtered_chunks = [c for c in relevant_chunks if is_purpose_relevant(c, user_query)]
return "\n\n".join([c.page_content for c in filtered_chunks])
This pattern also reduces hallucination risk by reducing noise — compliance and quality align here.
4.3 Response Logging: Do Not Persist What You Did Not Need
LLM inference platforms often log full request/response pairs by default. This creates a secondary dataset of personal data that may be significantly broader than what the application purpose requires.
Implement selective logging:
def log_inference_result(request_id: str, input_fields: list[str], output: str):
"""Log only non-personal metadata about the inference, not the content."""
log_entry = {
"request_id": request_id,
"timestamp": datetime.utcnow().isoformat(),
"input_field_count": len(input_fields),
"output_token_count": len(output.split()),
"model_version": MODEL_VERSION,
# Do NOT log: input content, output content, user identifiers
}
metrics_store.write(log_entry)
Full content logging requires its own legal basis and retention schedule — it is not free to maintain as a side effect of inference.
5. DPIA Trigger: When AI + Minimisation = Mandatory Assessment
Under GDPR Art.35, a DPIA is required where processing is "likely to result in a high risk to the rights and freedoms of natural persons." The European Data Protection Board (EDPB) has identified the following criteria as DPIA triggers — any two of these in combination require a DPIA:
- Evaluation or scoring — AI systems that score individuals
- Automated decision-making with legal or similar effect (Art.22)
- Systematic monitoring — surveillance or tracking
- Sensitive data processing — Art.9 categories
- Data processed on a large scale
- Matching or combining datasets — multiple sources merged
- Data on vulnerable individuals — children, employees
- Innovative use or technology — AI as novel technology
- Data transfer blocking or access control
High-risk AI systems under the EU AI Act will almost always trigger at least criteria 1, 5, and 8 — making DPIA mandatory. The minimisation documentation from your training pipeline becomes a DPIA input, not a separate exercise.
Connection to EU AI Act Art.10: the technical documentation required under Art.10 should include your minimisation log, feature selection rationale, and retention schedule. This means DPIA and AI Act technical file can be co-authored.
6. Special Category Data in AI Training (Art.10(4) + GDPR Art.9)
When your AI training dataset includes special category data (health, race, religion, biometric, genetic, sexual orientation data under GDPR Art.9), you need two independent legal bases:
- GDPR Art.9(2): explicit consent, legitimate interest override (rare), research purposes (Art.9(2)(j) with Art.89 safeguards), or another listed ground
- EU AI Act Art.10(4): bias detection and correction purpose only
This is the narrowest possible scope. If you are including racial or ethnic data "to improve model accuracy generally," that does not meet Art.10(4). The purpose must be explicit bias detection.
Documentation in your technical file should include:
- Which Art.9 ground was invoked
- What bias was being detected/corrected
- How the special category data was isolated (not merged into general training features)
- When the special category data will be deleted
7. Developer Checklist: Data Minimisation Compliance
Training Phase
- Feature inventory: Document every column in the training dataset and its purpose-necessity rationale
- Exclusion log: Record which features were considered and excluded, with reasoning (Art.5(1)(c) justification)
- Pseudonymisation applied: PII identifiers replaced before training splits are generated
- Special category audit: Art.9 categories identified; Art.10(4) / Art.9(2) basis documented if present
- Retention schedule: Raw data, processed splits, and checkpoints have documented deletion dates
- DP budget (if applicable): Differential privacy ε/δ values recorded in technical file
Inference Phase
- Input scope mapping: For each API endpoint, documented list of required input fields
- Context minimisation (for LLM applications): RAG retrieval retrieves relevant chunks only, not full histories
- Response logging scope: Logs capture metadata, not personal content, unless separate legal basis exists
- Request-level audit trail: Each inference logs which input fields were processed (for DPA audits)
Documentation
- DPIA triggered or waived: If high-risk AI under Annex III — DPIA is mandatory, include minimisation evidence
- EU AI Act Art.10 technical file section: Feature selection rationale, dataset suitability documentation
- Retention policy published: Internal policy document covering all dataset lifecycle stages
- Review cadence: Annual review scheduled for when model purpose changes
8. August 2026: What Happens Next
The EU AI Act obligations for high-risk AI providers under Chapter III become fully enforceable on August 2, 2026. National Competent Authorities (NCAs) gain powers to request technical documentation at that point.
For Art.10, NCAs are specifically empowered to review:
- The data governance practices used to prepare training datasets
- Whether data quality criteria were met
- Whether special category data was handled under Art.10(4) conditions
Developers who can demonstrate a documented minimisation process — feature selection logs, pseudonymisation steps, retention schedules — will be substantially better positioned for NCA inspection than those who relied on "we only used what we needed" as an undocumented assertion.
The DPIA documentation from GDPR compliance and the technical file from EU AI Act Art.10 are the same document in different regulatory framings. Build one artifact that satisfies both.
Conclusion
Data minimisation under GDPR Art.5(1)(c) is not an obstacle to building AI systems — it is a design constraint that improves them. Feature-selective training, pseudonymisation, differential privacy, and scoped inference inputs all reduce overfitting risk and make systems more defensible under audit.
The EU AI Act Art.10 requirements do not conflict with GDPR minimisation — they reinforce it, adding representativeness and data governance documentation requirements that give the minimisation principle operational teeth.
With August 2026 approaching, the time to build these practices into your training pipelines is now: retrofitting minimisation after a model is in production is significantly harder than designing for it from the beginning.
Related posts in this series:
- Post #1: Art.22 Automated Decision-Making + EU AI Act High-Risk Classification
- Post #2: DPIA for High-Risk AI — Art.35 GDPR + Art.9 EU AI Act Risk Management
- Post #3: Art.17 Right to Erasure: LLM Training Data Removal & RAG Vector Store Deletion
- Post #5 (coming next): AI + GDPR Full Compliance Stack: DPO Role, Accountability, Audit Trail Finale
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.