2026-06-06·5 min read·sota.io Team

Data Minimisation in AI: GDPR Art.5(1)(c) + EU AI Act Art.10 Developer Compliance Guide 2026

Post #4 in the sota.io EU AI Act + GDPR Intersection Series

Data minimisation in AI systems: GDPR Art.5(1)(c) and EU AI Act Art.10 compliance

AI systems are hungry for data. The more training data, the better the model — or so the conventional wisdom goes. But GDPR's data minimisation principle sits in direct tension with that instinct: personal data must be "adequate, relevant and limited to what is necessary" for its purpose.

This tension is now legally consequential. With EU AI Act enforcement for prohibited practices active since February 2026 and the August 2, 2026 deadline for high-risk AI obligations approaching, developers need to understand where GDPR Art.5(1)(c) intersects with EU AI Act Art.10 data governance requirements — and what that means for training datasets, inference pipelines, and production AI systems.

Article 5(1)(c) of the GDPR states that personal data shall be:

"adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed"

This is the data minimisation principle. It applies from the moment you collect data through to deletion. For AI developers, there are three distinct processing contexts where it applies:

Context	Minimisation Obligation
Training data collection	Only collect personal data attributes actually needed to train the model for its stated purpose
Training data processing	Strip, anonymise, or pseudonymise attributes not required for the model objective
Inference (production)	Do not process more user data per request than required to produce the output

The Purpose Limitation Link

Data minimisation cannot be evaluated in isolation — it is locked to the processing purpose established under Art.5(1)(b). Before you can assess whether data is "necessary," you must have a clearly documented purpose.

For AI systems, this means:

Training purpose: "Build a credit risk model using applicant financial history" → Only financial history attributes needed, not browsing history, social media data, or family information
Inference purpose: "Generate a summary of this support ticket" → Only the ticket content, not the full customer profile

A common violation pattern: developers ingest entire database snapshots for convenience, train on columns they didn't plan to use, and never prune the training dataset. Under GDPR, this is a Art.5(1)(c) breach even if the final model doesn't "expose" the excess data directly.

Fines Under Art.83(5)

Violations of Art.5 principles — including data minimisation — fall under Article 83(5) of the GDPR, the higher fine tier:

Up to €20,000,000 or 4% of total worldwide annual turnover of the preceding financial year, whichever is higher.

The CNIL's 2022 €150,000 fine against Doctissimo (French health portal) explicitly cited Art.5 violations including disproportionate data retention. DPAs are beginning to apply the same scrutiny to AI training datasets.

2. EU AI Act Art.10: Data Governance Requirements

Article 10 of the EU AI Act applies to high-risk AI systems (as defined in Annex III, operative from August 2026). It establishes data governance obligations for training, validation, and testing datasets.

Art.10(2) requires that training data practices address:

Relevant design choices
Data collection processes
Data preparation operations including cleaning, annotation, labelling, enrichment, and aggregation
Formulation of relevant assumptions about the data
Examination of the availability, quantity, and suitability of the datasets

Art.10(3) requires that training, validation, and testing datasets be:

Relevant for the intended purpose
Sufficiently representative (covering domain-relevant subgroups)
Free of errors to the best extent possible
Complete with respect to the characteristics required for the system's purpose

Art.10(4) addresses special categories of data (health, biometric, racial origin): these may only be used where strictly necessary for the purpose of detecting and correcting biases, with appropriate safeguards.

Where Art.10 and Art.5(1)(c) Align

Both requirements share a common core: fitness for purpose. Art.10(3)'s "relevant" and "sufficient" framing maps directly onto Art.5(1)(c)'s "adequate" and "relevant."

The practical effect: a dataset that passes GDPR minimisation review will generally also satisfy the relevance and suitability tests under Art.10. The reverse is also true — an overly broad dataset that fails the minimisation test will struggle to demonstrate it is "limited to what is necessary" for the AI purpose.

Where They Diverge

The EU AI Act introduces a representativeness requirement that GDPR does not. Art.10(3) requires datasets to cover relevant subgroups; too little data about a protected characteristic can create biased models — but that very characteristic may also be data you should minimise under GDPR.

This is the core tension developers must navigate:

Concern	GDPR Direction	EU AI Act Direction
Gender data in training set	Minimise unless strictly necessary	Include for fairness/bias detection
Ethnicity/race data	Sensitive category — restrict to Art.9 basis	May be needed for representativeness
Age data	Limit to what the model actually uses	Required for age-sensitive applications

Resolution: The EU AI Act Art.10(4) carveout is deliberately narrow. Special category data is allowed only for bias correction, not for general training. Document this basis explicitly and treat it as a data protection impact assessment (DPIA) trigger.

3. Training Data Minimisation: Practical Patterns

3.1 Feature Selection as a Compliance Tool

The most direct implementation of Art.5(1)(c) is treating feature selection as a compliance step, not just a modelling step.

Before training:

# Define which features are purpose-necessary
REQUIRED_FEATURES = {
    "loan_default_prediction": [
        "loan_amount", "loan_term_months", "income_bracket",
        "employment_type", "debt_to_income_ratio", "credit_score_band"
    ]
    # NOT included: postcode, age, marital_status, number_of_dependants
    # unless purpose explicitly requires them
}

def load_training_data(purpose: str) -> pd.DataFrame:
    df = raw_df[REQUIRED_FEATURES[purpose]].copy()
    # Document what was excluded and why
    log_minimisation_decision(
        purpose=purpose,
        included=REQUIRED_FEATURES[purpose],
        excluded=[c for c in raw_df.columns if c not in REQUIRED_FEATURES[purpose]],
        legal_basis="Article 5(1)(c) GDPR — limited to what is necessary for stated purpose"
    )
    return df

This audit log is directly relevant to DPIA documentation under GDPR Art.35 and EU AI Act technical documentation requirements.

3.2 Pseudonymisation Before Training

Where individual identity is not required for model training (which covers most supervised learning tasks), pseudonymise before the data enters the training pipeline.

import hashlib
import pandas as pd

def pseudonymise_training_dataset(df: pd.DataFrame, pii_columns: list[str], salt: str) -> pd.DataFrame:
    """
    Replace PII identifiers with stable pseudonyms.
    Salt should be stored separately (not in training data or model artifacts).
    """
    df = df.copy()
    for col in pii_columns:
        df[col] = df[col].apply(
            lambda v: hashlib.sha256(f"{salt}{v}".encode()).hexdigest()[:16]
            if pd.notna(v) else None
        )
    return df

# Apply before any training split
df_train = pseudonymise_training_dataset(
    df=raw_training_data,
    pii_columns=["user_id", "email", "ip_address"],
    salt=os.environ["PSEUDONYM_SALT"]
)

Under GDPR Recital 26 and Art.4(5), pseudonymised data that cannot be re-identified without additional information (held separately) reduces the effective risk level.

3.3 Differential Privacy for Statistical Noise

Where the training dataset contains sensitive attributes that cannot be fully removed, differential privacy (DP) adds calibrated noise during training that provides mathematical privacy guarantees.

For PyTorch with Opacus (open source, EU-data-centre deployable):

from opacus import PrivacyEngine

model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    noise_multiplier=1.1,   # ε-δ budget: tune to your risk appetite
    max_grad_norm=1.0,
)

# After training, document the privacy budget spent
epsilon, delta = privacy_engine.get_epsilon(delta=1e-5)
print(f"Training DP budget: ε={epsilon:.2f}, δ={delta}")
# Record epsilon/delta in technical documentation (EU AI Act Art.10 requirement)

Documenting the DP budget is increasingly expected in EU AI Act technical files — it demonstrates that data minimisation was actively engineered into the training process.

3.4 Retention Limits for Training Datasets

Data minimisation applies not just to which data is included but to how long it is retained. GDPR Art.5(1)(e) (storage limitation) reinforces this.

Define explicit retention schedules for:

Raw collected data: Delete once training splits are generated
Training/validation/test splits: Delete or archive at end of model lifecycle
Experiment logs: Retain metadata, delete sample rows
Model checkpoints: Each checkpoint may embed training data statistics — treat as personal data if re-identification is theoretically possible

# Training pipeline cleanup hook
def cleanup_training_artifacts(run_id: str, retain_final_model: bool = True):
    paths = {
        "raw_data": f"data/raw/{run_id}/",
        "intermediate_splits": f"data/processed/{run_id}/",
        "checkpoints": f"models/{run_id}/checkpoints/",
    }
    for name, path in paths.items():
        if name == "checkpoints" and retain_final_model:
            # Keep only the final checkpoint, delete intermediates
            delete_all_except_latest(path)
        else:
            shutil.rmtree(path, ignore_errors=True)
            logging.info(f"Deleted {name} at {path} — retention limit reached")

4. Inference-Time Minimisation

Training data compliance is only half the equation. Production AI systems process personal data at inference time on every API call.

4.1 Input Scope: What Does the Model Actually Need?

Before building the input payload for a model call, apply the same "necessary for purpose" test:

def build_model_input(user_context: dict, purpose: str) -> dict:
    """
    Build a minimised input payload — only fields the model needs for this purpose.
    """
    INFERENCE_FEATURE_MAP = {
        "sentiment_analysis": ["message_text"],
        "risk_scoring": ["transaction_amount", "merchant_category", "time_of_day"],
        "document_summary": ["document_text"],
    }

    allowed = INFERENCE_FEATURE_MAP.get(purpose, [])
    minimised = {k: v for k, v in user_context.items() if k in allowed}

    audit_log.info({
        "event": "inference_input_minimised",
        "purpose": purpose,
        "fields_included": list(minimised.keys()),
        "fields_excluded": [k for k in user_context if k not in allowed],
    })

    return minimised

4.2 LLM Context Windows: The Hidden PII Problem

For applications using large language models, the context window is the primary minimisation risk. Developers routinely pass entire user histories, full database records, or complete document stores as context — when only a fraction is relevant to the query.

The pattern to avoid:

# PROBLEMATIC: passes entire user profile to LLM
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f"Customer: {json.dumps(full_customer_record)}\n\nQuestion: {user_query}"}
]

The pattern to adopt:

# COMPLIANT: retrieve-then-minimise before LLM call
def build_rag_context(user_query: str, user_id: str) -> str:
    # Step 1: Retrieve relevant chunks only
    relevant_chunks = vector_store.similarity_search(
        query=user_query,
        filter={"user_id": user_id},
        k=3  # top 3 relevant chunks, not full history
    )
    # Step 2: Exclude chunks with sensitive categories not relevant to query
    filtered_chunks = [c for c in relevant_chunks if is_purpose_relevant(c, user_query)]

    return "\n\n".join([c.page_content for c in filtered_chunks])

This pattern also reduces hallucination risk by reducing noise — compliance and quality align here.

4.3 Response Logging: Do Not Persist What You Did Not Need

LLM inference platforms often log full request/response pairs by default. This creates a secondary dataset of personal data that may be significantly broader than what the application purpose requires.

Implement selective logging:

def log_inference_result(request_id: str, input_fields: list[str], output: str):
    """Log only non-personal metadata about the inference, not the content."""
    log_entry = {
        "request_id": request_id,
        "timestamp": datetime.utcnow().isoformat(),
        "input_field_count": len(input_fields),
        "output_token_count": len(output.split()),
        "model_version": MODEL_VERSION,
        # Do NOT log: input content, output content, user identifiers
    }
    metrics_store.write(log_entry)

Full content logging requires its own legal basis and retention schedule — it is not free to maintain as a side effect of inference.

5. DPIA Trigger: When AI + Minimisation = Mandatory Assessment

Under GDPR Art.35, a DPIA is required where processing is "likely to result in a high risk to the rights and freedoms of natural persons." The European Data Protection Board (EDPB) has identified the following criteria as DPIA triggers — any two of these in combination require a DPIA:

Evaluation or scoring — AI systems that score individuals
Automated decision-making with legal or similar effect (Art.22)
Systematic monitoring — surveillance or tracking
Sensitive data processing — Art.9 categories
Data processed on a large scale
Matching or combining datasets — multiple sources merged
Data on vulnerable individuals — children, employees
Innovative use or technology — AI as novel technology
Data transfer blocking or access control

High-risk AI systems under the EU AI Act will almost always trigger at least criteria 1, 5, and 8 — making DPIA mandatory. The minimisation documentation from your training pipeline becomes a DPIA input, not a separate exercise.

Connection to EU AI Act Art.10: the technical documentation required under Art.10 should include your minimisation log, feature selection rationale, and retention schedule. This means DPIA and AI Act technical file can be co-authored.

When your AI training dataset includes special category data (health, race, religion, biometric, genetic, sexual orientation data under GDPR Art.9), you need two independent legal bases:

GDPR Art.9(2): explicit consent, legitimate interest override (rare), research purposes (Art.9(2)(j) with Art.89 safeguards), or another listed ground
EU AI Act Art.10(4): bias detection and correction purpose only

This is the narrowest possible scope. If you are including racial or ethnic data "to improve model accuracy generally," that does not meet Art.10(4). The purpose must be explicit bias detection.

Documentation in your technical file should include:

Which Art.9 ground was invoked
What bias was being detected/corrected
How the special category data was isolated (not merged into general training features)
When the special category data will be deleted

7. Developer Checklist: Data Minimisation Compliance

Training Phase

Feature inventory: Document every column in the training dataset and its purpose-necessity rationale
Exclusion log: Record which features were considered and excluded, with reasoning (Art.5(1)(c) justification)
Pseudonymisation applied: PII identifiers replaced before training splits are generated
Special category audit: Art.9 categories identified; Art.10(4) / Art.9(2) basis documented if present
Retention schedule: Raw data, processed splits, and checkpoints have documented deletion dates
DP budget (if applicable): Differential privacy ε/δ values recorded in technical file

Inference Phase

Input scope mapping: For each API endpoint, documented list of required input fields
Context minimisation (for LLM applications): RAG retrieval retrieves relevant chunks only, not full histories
Response logging scope: Logs capture metadata, not personal content, unless separate legal basis exists
Request-level audit trail: Each inference logs which input fields were processed (for DPA audits)

Documentation

DPIA triggered or waived: If high-risk AI under Annex III — DPIA is mandatory, include minimisation evidence
EU AI Act Art.10 technical file section: Feature selection rationale, dataset suitability documentation
Retention policy published: Internal policy document covering all dataset lifecycle stages
Review cadence: Annual review scheduled for when model purpose changes

8. August 2026: What Happens Next

The EU AI Act obligations for high-risk AI providers under Chapter III become fully enforceable on August 2, 2026. National Competent Authorities (NCAs) gain powers to request technical documentation at that point.

For Art.10, NCAs are specifically empowered to review:

The data governance practices used to prepare training datasets
Whether data quality criteria were met
Whether special category data was handled under Art.10(4) conditions

Developers who can demonstrate a documented minimisation process — feature selection logs, pseudonymisation steps, retention schedules — will be substantially better positioned for NCA inspection than those who relied on "we only used what we needed" as an undocumented assertion.

The DPIA documentation from GDPR compliance and the technical file from EU AI Act Art.10 are the same document in different regulatory framings. Build one artifact that satisfies both.

Conclusion

Data minimisation under GDPR Art.5(1)(c) is not an obstacle to building AI systems — it is a design constraint that improves them. Feature-selective training, pseudonymisation, differential privacy, and scoped inference inputs all reduce overfitting risk and make systems more defensible under audit.

The EU AI Act Art.10 requirements do not conflict with GDPR minimisation — they reinforce it, adding representativeness and data governance documentation requirements that give the minimisation principle operational teeth.

With August 2026 approaching, the time to build these practices into your training pipelines is now: retrofitting minimisation after a model is in production is significantly harder than designing for it from the beginning.

Related posts in this series:

Post #1: Art.22 Automated Decision-Making + EU AI Act High-Risk Classification
Post #2: DPIA for High-Risk AI — Art.35 GDPR + Art.9 EU AI Act Risk Management
Post #3: Art.17 Right to Erasure: LLM Training Data Removal & RAG Vector Store Deletion
Post #5 (coming next): AI + GDPR Full Compliance Stack: DPO Role, Accountability, Audit Trail Finale

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View plans