2026-06-06·5 min read·sota.io Team

Data Minimisation in AI: GDPR Art.5(1)(c) + EU AI Act Art.10 Developer Compliance Guide 2026

Post #4 in the sota.io EU AI Act + GDPR Intersection Series

Data minimisation in AI systems: GDPR Art.5(1)(c) and EU AI Act Art.10 compliance

AI systems are hungry for data. The more training data, the better the model — or so the conventional wisdom goes. But GDPR's data minimisation principle sits in direct tension with that instinct: personal data must be "adequate, relevant and limited to what is necessary" for its purpose.

This tension is now legally consequential. With EU AI Act enforcement for prohibited practices active since February 2026 and the August 2, 2026 deadline for high-risk AI obligations approaching, developers need to understand where GDPR Art.5(1)(c) intersects with EU AI Act Art.10 data governance requirements — and what that means for training datasets, inference pipelines, and production AI systems.


1. GDPR Art.5(1)(c): Data Minimisation Explained for AI

Article 5(1)(c) of the GDPR states that personal data shall be:

"adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed"

This is the data minimisation principle. It applies from the moment you collect data through to deletion. For AI developers, there are three distinct processing contexts where it applies:

ContextMinimisation Obligation
Training data collectionOnly collect personal data attributes actually needed to train the model for its stated purpose
Training data processingStrip, anonymise, or pseudonymise attributes not required for the model objective
Inference (production)Do not process more user data per request than required to produce the output

Data minimisation cannot be evaluated in isolation — it is locked to the processing purpose established under Art.5(1)(b). Before you can assess whether data is "necessary," you must have a clearly documented purpose.

For AI systems, this means:

A common violation pattern: developers ingest entire database snapshots for convenience, train on columns they didn't plan to use, and never prune the training dataset. Under GDPR, this is a Art.5(1)(c) breach even if the final model doesn't "expose" the excess data directly.

Fines Under Art.83(5)

Violations of Art.5 principles — including data minimisation — fall under Article 83(5) of the GDPR, the higher fine tier:

The CNIL's 2022 €150,000 fine against Doctissimo (French health portal) explicitly cited Art.5 violations including disproportionate data retention. DPAs are beginning to apply the same scrutiny to AI training datasets.


2. EU AI Act Art.10: Data Governance Requirements

Article 10 of the EU AI Act applies to high-risk AI systems (as defined in Annex III, operative from August 2026). It establishes data governance obligations for training, validation, and testing datasets.

Art.10(2) requires that training data practices address:

Art.10(3) requires that training, validation, and testing datasets be:

Art.10(4) addresses special categories of data (health, biometric, racial origin): these may only be used where strictly necessary for the purpose of detecting and correcting biases, with appropriate safeguards.

Where Art.10 and Art.5(1)(c) Align

Both requirements share a common core: fitness for purpose. Art.10(3)'s "relevant" and "sufficient" framing maps directly onto Art.5(1)(c)'s "adequate" and "relevant."

The practical effect: a dataset that passes GDPR minimisation review will generally also satisfy the relevance and suitability tests under Art.10. The reverse is also true — an overly broad dataset that fails the minimisation test will struggle to demonstrate it is "limited to what is necessary" for the AI purpose.

Where They Diverge

The EU AI Act introduces a representativeness requirement that GDPR does not. Art.10(3) requires datasets to cover relevant subgroups; too little data about a protected characteristic can create biased models — but that very characteristic may also be data you should minimise under GDPR.

This is the core tension developers must navigate:

ConcernGDPR DirectionEU AI Act Direction
Gender data in training setMinimise unless strictly necessaryInclude for fairness/bias detection
Ethnicity/race dataSensitive category — restrict to Art.9 basisMay be needed for representativeness
Age dataLimit to what the model actually usesRequired for age-sensitive applications

Resolution: The EU AI Act Art.10(4) carveout is deliberately narrow. Special category data is allowed only for bias correction, not for general training. Document this basis explicitly and treat it as a data protection impact assessment (DPIA) trigger.


3. Training Data Minimisation: Practical Patterns

3.1 Feature Selection as a Compliance Tool

The most direct implementation of Art.5(1)(c) is treating feature selection as a compliance step, not just a modelling step.

Before training:

# Define which features are purpose-necessary
REQUIRED_FEATURES = {
    "loan_default_prediction": [
        "loan_amount", "loan_term_months", "income_bracket",
        "employment_type", "debt_to_income_ratio", "credit_score_band"
    ]
    # NOT included: postcode, age, marital_status, number_of_dependants
    # unless purpose explicitly requires them
}

def load_training_data(purpose: str) -> pd.DataFrame:
    df = raw_df[REQUIRED_FEATURES[purpose]].copy()
    # Document what was excluded and why
    log_minimisation_decision(
        purpose=purpose,
        included=REQUIRED_FEATURES[purpose],
        excluded=[c for c in raw_df.columns if c not in REQUIRED_FEATURES[purpose]],
        legal_basis="Article 5(1)(c) GDPR — limited to what is necessary for stated purpose"
    )
    return df

This audit log is directly relevant to DPIA documentation under GDPR Art.35 and EU AI Act technical documentation requirements.

3.2 Pseudonymisation Before Training

Where individual identity is not required for model training (which covers most supervised learning tasks), pseudonymise before the data enters the training pipeline.

import hashlib
import pandas as pd

def pseudonymise_training_dataset(df: pd.DataFrame, pii_columns: list[str], salt: str) -> pd.DataFrame:
    """
    Replace PII identifiers with stable pseudonyms.
    Salt should be stored separately (not in training data or model artifacts).
    """
    df = df.copy()
    for col in pii_columns:
        df[col] = df[col].apply(
            lambda v: hashlib.sha256(f"{salt}{v}".encode()).hexdigest()[:16]
            if pd.notna(v) else None
        )
    return df

# Apply before any training split
df_train = pseudonymise_training_dataset(
    df=raw_training_data,
    pii_columns=["user_id", "email", "ip_address"],
    salt=os.environ["PSEUDONYM_SALT"]
)

Under GDPR Recital 26 and Art.4(5), pseudonymised data that cannot be re-identified without additional information (held separately) reduces the effective risk level.

3.3 Differential Privacy for Statistical Noise

Where the training dataset contains sensitive attributes that cannot be fully removed, differential privacy (DP) adds calibrated noise during training that provides mathematical privacy guarantees.

For PyTorch with Opacus (open source, EU-data-centre deployable):

from opacus import PrivacyEngine

model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

privacy_engine = PrivacyEngine()
model, optimizer, data_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    noise_multiplier=1.1,   # ε-δ budget: tune to your risk appetite
    max_grad_norm=1.0,
)

# After training, document the privacy budget spent
epsilon, delta = privacy_engine.get_epsilon(delta=1e-5)
print(f"Training DP budget: ε={epsilon:.2f}, δ={delta}")
# Record epsilon/delta in technical documentation (EU AI Act Art.10 requirement)

Documenting the DP budget is increasingly expected in EU AI Act technical files — it demonstrates that data minimisation was actively engineered into the training process.

3.4 Retention Limits for Training Datasets

Data minimisation applies not just to which data is included but to how long it is retained. GDPR Art.5(1)(e) (storage limitation) reinforces this.

Define explicit retention schedules for:

# Training pipeline cleanup hook
def cleanup_training_artifacts(run_id: str, retain_final_model: bool = True):
    paths = {
        "raw_data": f"data/raw/{run_id}/",
        "intermediate_splits": f"data/processed/{run_id}/",
        "checkpoints": f"models/{run_id}/checkpoints/",
    }
    for name, path in paths.items():
        if name == "checkpoints" and retain_final_model:
            # Keep only the final checkpoint, delete intermediates
            delete_all_except_latest(path)
        else:
            shutil.rmtree(path, ignore_errors=True)
            logging.info(f"Deleted {name} at {path} — retention limit reached")

4. Inference-Time Minimisation

Training data compliance is only half the equation. Production AI systems process personal data at inference time on every API call.

4.1 Input Scope: What Does the Model Actually Need?

Before building the input payload for a model call, apply the same "necessary for purpose" test:

def build_model_input(user_context: dict, purpose: str) -> dict:
    """
    Build a minimised input payload — only fields the model needs for this purpose.
    """
    INFERENCE_FEATURE_MAP = {
        "sentiment_analysis": ["message_text"],
        "risk_scoring": ["transaction_amount", "merchant_category", "time_of_day"],
        "document_summary": ["document_text"],
    }

    allowed = INFERENCE_FEATURE_MAP.get(purpose, [])
    minimised = {k: v for k, v in user_context.items() if k in allowed}

    audit_log.info({
        "event": "inference_input_minimised",
        "purpose": purpose,
        "fields_included": list(minimised.keys()),
        "fields_excluded": [k for k in user_context if k not in allowed],
    })

    return minimised

4.2 LLM Context Windows: The Hidden PII Problem

For applications using large language models, the context window is the primary minimisation risk. Developers routinely pass entire user histories, full database records, or complete document stores as context — when only a fraction is relevant to the query.

The pattern to avoid:

# PROBLEMATIC: passes entire user profile to LLM
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f"Customer: {json.dumps(full_customer_record)}\n\nQuestion: {user_query}"}
]

The pattern to adopt:

# COMPLIANT: retrieve-then-minimise before LLM call
def build_rag_context(user_query: str, user_id: str) -> str:
    # Step 1: Retrieve relevant chunks only
    relevant_chunks = vector_store.similarity_search(
        query=user_query,
        filter={"user_id": user_id},
        k=3  # top 3 relevant chunks, not full history
    )
    # Step 2: Exclude chunks with sensitive categories not relevant to query
    filtered_chunks = [c for c in relevant_chunks if is_purpose_relevant(c, user_query)]

    return "\n\n".join([c.page_content for c in filtered_chunks])

This pattern also reduces hallucination risk by reducing noise — compliance and quality align here.

4.3 Response Logging: Do Not Persist What You Did Not Need

LLM inference platforms often log full request/response pairs by default. This creates a secondary dataset of personal data that may be significantly broader than what the application purpose requires.

Implement selective logging:

def log_inference_result(request_id: str, input_fields: list[str], output: str):
    """Log only non-personal metadata about the inference, not the content."""
    log_entry = {
        "request_id": request_id,
        "timestamp": datetime.utcnow().isoformat(),
        "input_field_count": len(input_fields),
        "output_token_count": len(output.split()),
        "model_version": MODEL_VERSION,
        # Do NOT log: input content, output content, user identifiers
    }
    metrics_store.write(log_entry)

Full content logging requires its own legal basis and retention schedule — it is not free to maintain as a side effect of inference.


5. DPIA Trigger: When AI + Minimisation = Mandatory Assessment

Under GDPR Art.35, a DPIA is required where processing is "likely to result in a high risk to the rights and freedoms of natural persons." The European Data Protection Board (EDPB) has identified the following criteria as DPIA triggers — any two of these in combination require a DPIA:

  1. Evaluation or scoring — AI systems that score individuals
  2. Automated decision-making with legal or similar effect (Art.22)
  3. Systematic monitoring — surveillance or tracking
  4. Sensitive data processing — Art.9 categories
  5. Data processed on a large scale
  6. Matching or combining datasets — multiple sources merged
  7. Data on vulnerable individuals — children, employees
  8. Innovative use or technology — AI as novel technology
  9. Data transfer blocking or access control

High-risk AI systems under the EU AI Act will almost always trigger at least criteria 1, 5, and 8 — making DPIA mandatory. The minimisation documentation from your training pipeline becomes a DPIA input, not a separate exercise.

Connection to EU AI Act Art.10: the technical documentation required under Art.10 should include your minimisation log, feature selection rationale, and retention schedule. This means DPIA and AI Act technical file can be co-authored.


6. Special Category Data in AI Training (Art.10(4) + GDPR Art.9)

When your AI training dataset includes special category data (health, race, religion, biometric, genetic, sexual orientation data under GDPR Art.9), you need two independent legal bases:

  1. GDPR Art.9(2): explicit consent, legitimate interest override (rare), research purposes (Art.9(2)(j) with Art.89 safeguards), or another listed ground
  2. EU AI Act Art.10(4): bias detection and correction purpose only

This is the narrowest possible scope. If you are including racial or ethnic data "to improve model accuracy generally," that does not meet Art.10(4). The purpose must be explicit bias detection.

Documentation in your technical file should include:


7. Developer Checklist: Data Minimisation Compliance

Training Phase

Inference Phase

Documentation


8. August 2026: What Happens Next

The EU AI Act obligations for high-risk AI providers under Chapter III become fully enforceable on August 2, 2026. National Competent Authorities (NCAs) gain powers to request technical documentation at that point.

For Art.10, NCAs are specifically empowered to review:

Developers who can demonstrate a documented minimisation process — feature selection logs, pseudonymisation steps, retention schedules — will be substantially better positioned for NCA inspection than those who relied on "we only used what we needed" as an undocumented assertion.

The DPIA documentation from GDPR compliance and the technical file from EU AI Act Art.10 are the same document in different regulatory framings. Build one artifact that satisfies both.


Conclusion

Data minimisation under GDPR Art.5(1)(c) is not an obstacle to building AI systems — it is a design constraint that improves them. Feature-selective training, pseudonymisation, differential privacy, and scoped inference inputs all reduce overfitting risk and make systems more defensible under audit.

The EU AI Act Art.10 requirements do not conflict with GDPR minimisation — they reinforce it, adding representativeness and data governance documentation requirements that give the minimisation principle operational teeth.

With August 2026 approaching, the time to build these practices into your training pipelines is now: retrofitting minimisation after a model is in production is significantly harder than designing for it from the beginning.


Related posts in this series:

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.