GDPR and AI Fine-Tuning: Can You Use Your SaaS Users' Data to Train LLMs?
Your SaaS product generates exactly the domain-specific data that makes an LLM genuinely useful: support tickets, code snippets, legal drafts, medical notes, customer messages. Fine-tuning a model on this data could dramatically improve product quality. But does using it for training comply with GDPR?
The answer depends on three questions that most privacy policies sidestep entirely: Was training disclosed as a purpose when data was originally collected? Does any lawful basis actually apply to training as a secondary purpose? And if a user exercises their right to erasure, can you actually remove their data from model weights?
This guide works through each question with the precision that GDPR supervisory authorities expect — not the privacy-policy minimalism that gets companies fined.
The Core GDPR Constraint: Purpose Limitation
GDPR Article 5(1)(b) requires that personal data be "collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes."
This single provision is the source of most AI training compliance problems. When a user signs up for your SaaS product and submits data — a support request, a generated document, a transaction record — they typically do so for an operational purpose: resolving an issue, creating a document, executing a payment. They did not submit data to improve a machine learning model.
The question is whether using that data for AI training is compatible with the original collection purpose under Recital 50 of the GDPR, or whether it constitutes incompatible secondary processing that requires a fresh lawful basis.
The Compatibility Test (Recital 50)
Before you reach Article 6 lawful basis, you must pass the compatibility test. The GDPR requires you to assess:
- Link between purposes — Is there a clear connection between "customer support" (original purpose) and "model training" (new purpose)?
- Context of collection — What were the reasonable expectations of the data subject when they submitted data? Did they expect it to be used for training?
- Nature of the data — Is it ordinary personal data, special category data (Art. 9), or data relating to criminal matters (Art. 10)?
- Consequences — What are the possible consequences of the further processing for the data subject?
- Safeguards — Are appropriate safeguards in place (encryption, pseudonymisation)?
For most SaaS training scenarios, points 2 and 4 create significant friction. Users do not typically expect operational communications to train AI models. The consequences — their communication patterns, writing style, and business logic embedded in a model — are non-obvious and potentially significant.
EDPB Opinion 28/2023 on AI models (adopted April 2024) explicitly confirmed that purpose limitation applies to AI model training and that compatibility cannot be assumed simply because the controller finds training "useful."
The Available Lawful Bases for AI Training
If compatibility analysis does not resolve the question (and for training, it usually does not), you need a standalone Article 6 lawful basis for the training activity.
Option 1: Consent (Art. 6(1)(a))
Consent is the cleanest basis for AI training — when properly obtained.
What valid consent requires:
- Freely given, specific, informed, and unambiguous indication of agreement
- Separate, granular consent for the training purpose (not bundled with service terms)
- The ability to withdraw consent without detriment to service access
- Clear information about: what data is used, which models are trained, how long the data is retained for training purposes, and what happens to the model if consent is withdrawn
The withdrawal problem: GDPR Article 7(3) requires that withdrawal of consent be "as easy as giving" it. For AI training, this creates a technical problem: a user who withdraws consent needs their training data excluded not only from future training runs but, arguably, from models already trained. More on this below.
The free service trap: If your product is free and consent to training is a condition of access, consent may not be "freely given" under Article 7(4). A user who cannot use the service without agreeing to training has no genuine choice. The EDPB's Guidelines 05/2020 on consent make this explicit.
Practical implementation:
Training Consent Disclosure:
"We would like to use your [support messages / documents / queries]
to improve our AI features. This is separate from providing you
the service. You can say no — it won't affect your account.
[ ] Yes, use my data to train AI models
[ ] No thanks
You can change this any time in Settings > Privacy > AI Training."
Option 2: Legitimate Interest (Art. 6(1)(f))
Legitimate interest (LI) is the basis most companies reach for when they do not want to ask for consent. But LI requires a three-part balancing test that DPAs examine closely.
The three-step legitimate interest assessment (LIA):
Step 1 — Identify the legitimate interest: What is your genuine interest? "Improving our AI models" is likely legitimate in principle — it is a real business interest with downstream user benefit. This step rarely fails.
Step 2 — Necessity: Is AI training on user data necessary to achieve this interest? Could you achieve the same result with synthetic data, publicly licensed data, or anonymised data? If yes, the necessity test fails. This is where many LI claims collapse — if there are less privacy-invasive alternatives that achieve the same goal, LI cannot be the lawful basis.
Step 3 — Balancing: Do the data subject's interests or fundamental rights override your legitimate interest? The relevant factors include:
- Whether the data subject would reasonably expect training
- The sensitivity of the data
- Whether the training creates a power imbalance
- What safeguards you apply (pseudonymisation, aggregation, access controls)
For consumer-facing SaaS with general audiences, balancing often tips against the controller when: (a) users submitted professional or sensitive content, (b) training was not disclosed in clear terms at the point of data collection, or (c) the data reveals patterns users did not intend to share (communication style, response times, business logic).
For B2B SaaS with sophisticated business customers who have negotiated DPAs and where training is disclosed in the contract, the balancing may tip differently.
EDPB Guidance: The EDPB's 2024 Opinion 28/2023 on AI model training noted that legitimate interest cannot be used as a "blanket basis" for all AI training and requires genuine balancing — not a pro forma assessment.
Option 3: Contract Necessity (Art. 6(1)(b))
Contract necessity applies when processing is "necessary for the performance of a contract." This basis is almost never valid for AI training, because training a model is not necessary to deliver the contracted service. The user hired you to process their documents, not to train your AI on those documents. Contract necessity fails the necessity test.
Option 4: Statutory Research Exemption (Art. 89)
Article 89 allows processing for "archiving purposes in the public interest, scientific or historical research purposes or statistical purposes" with appropriate safeguards. This exemption is narrow and designed for academic and public-interest research — it does not generally cover commercial product improvement training.
Special Categories: When Training Is Strictly Limited
If your SaaS handles special category data (Art. 9) — health data, biometric data, data revealing racial or ethnic origin, political opinions, religious beliefs, trade union membership, genetic data, or sexual life/orientation — the baseline restrictions are much stricter.
Special category data requires both an Article 6 basis and an Article 9 basis for processing. For AI training on special category data, the most relevant Article 9 grounds are:
- Art. 9(2)(a): Explicit consent (higher threshold than standard consent — must be granular and specific to training)
- Art. 9(2)(j): Scientific research with appropriate safeguards (narrow; commercial training rarely qualifies)
There is no legitimate interest ground in Article 9. You cannot use LI to train models on health records, therapy session transcripts, biometric authentication logs, or employment dispute documents.
Practical implication: Medical, HR tech, legal tech, and financial SaaS platforms with emotional or health-related data components face a hard consent requirement for AI training — or must use fully anonymised data that falls outside GDPR scope entirely.
The Right to Erasure: The Unsolved Technical Problem
GDPR Article 17 grants data subjects the right to erasure ("right to be forgotten"). When a user requests deletion, you must erase their personal data "without undue delay."
For training data retained in a database, this is straightforward: delete the rows. For data already used to train a model, the question is unresolved at law and genuinely technically difficult.
What DPAs Have Said
The Italian DPA (Garante) and the French DPA (CNIL) have both indicated that if training data cannot be erased from a deployed model, the training of that model may have lacked a valid lawful basis from the start — effectively working backwards.
The EDPB Opinion 28/2023 stated that if training data cannot be effectively erased upon request, this "raises serious questions about the compatibility of the training with data protection law" and controllers must be able to demonstrate compliance.
No DPA has issued a binding ruling that model weights must be modified upon erasure requests. But no DPA has issued a ruling that they need not be, either.
The Technical Reality
For large models, "unlearning" a specific data point is an active research area (machine unlearning) but not an operationally deployable capability for most organisations. Practical approaches:
- Model retraining from scratch on a dataset excluding the requester's data — feasible for smaller models, prohibitively expensive for large ones
- Differential privacy during training — guarantees (via DP-SGD) that individual records cannot be identified from model weights, potentially satisfying the "cannot be reasonably identified" threshold of GDPR pseudonymisation
- Influence function estimation — approximate the effect of removing a data point on model parameters; can bound the privacy risk without full retraining
- Strict retention windows — train only on data from the last N months, with rolling deletion; erasure requests are resolved at the next retraining cycle
What you must not do: Claim that because model weights are "not personal data," Article 17 does not apply. DPAs examine the training data, not only the output. If personal data was used to train a model and erasure cannot be demonstrated, the training itself may be found unlawful.
Practical Implementation: Building a GDPR-Compliant AI Training Pipeline
Step 1: Identify Your Training Data Categories
Map each data type your training pipeline ingests:
- User-generated content (messages, documents, queries)
- Behavioural data (click patterns, session data, feature usage)
- Account/profile data (name, role, company)
- Derived/inferred data (sentiment scores, topic classifications)
Determine whether any field is or could be special category data under Art. 9.
Step 2: Determine Lawful Basis Per Category
| Data Type | Disclosed at Collection? | Lawful Basis Option | Viable? |
|---|---|---|---|
| Support tickets (anonymised) | No | LI after pseudonymisation | Maybe |
| User documents | No | Explicit consent | Yes |
| Health records | No | Explicit consent (Art. 9(2)(a)) | Only |
| Usage analytics (aggregated) | Maybe | LI or consent | Often |
| User messages | No | Explicit consent | Preferred |
Step 3: Implement Consent Infrastructure
If you choose consent as your basis:
// Training consent captured at meaningful moment
interface TrainingConsentRecord {
userId: string;
consentGiven: boolean;
consentTimestamp: ISO8601String;
consentVersion: string; // version of disclosure shown
disclosureHash: string; // hash of the disclosure text
withdrawnAt?: ISO8601String;
}
// Enforce at data extraction time
async function extractTrainingData(userId: string): Promise<TrainingRecord[]> {
const consent = await getConsentRecord(userId);
if (!consent.consentGiven || consent.withdrawnAt) {
return []; // exclude from training dataset
}
return fetchRecords(userId);
}
Step 4: Build Erasure-Compatible Training Records
Before training, pseudonymise data at the record level and maintain a mapping:
# Pseudonymisation layer before training data export
import hashlib
def pseudonymise_record(record: dict, salt: bytes) -> dict:
"""Replace direct identifiers with HMAC pseudonyms."""
user_id = record.pop('user_id')
record['pseudo_id'] = hmac.new(
salt, user_id.encode(), hashlib.sha256
).hexdigest()
# Remove or hash other direct identifiers
record.pop('email', None)
record.pop('name', None)
return record
# On erasure request: delete the pseudonymisation mapping
# The model never saw raw identifiers — only the pseudonym
def erase_user(user_id: str, salt: bytes):
pseudo_id = compute_pseudo_id(user_id, salt)
# Delete from consent store
delete_consent_record(user_id)
# Delete from raw training pool
delete_raw_records(user_id)
# Record: pseudo_id X was in training data, now erased
log_erasure_event(pseudo_id)
# If model retraining is feasible: schedule exclusion run
schedule_retraining_exclusion(pseudo_id)
Step 5: Implement Differential Privacy (for High-Volume Training)
If you train frequently on large datasets and cannot feasibly respond to erasure by retraining:
from opacus import PrivacyEngine # PyTorch DP training
# DP-SGD training guarantees bounded individual contribution
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
module=model,
optimizer=optimizer,
data_loader=train_loader,
epochs=EPOCHS,
target_epsilon=3.0, # epsilon=3 considered "strong" privacy
target_delta=1e-5, # delta < 1/n where n = dataset size
max_grad_norm=1.0,
)
At epsilon=3, differential privacy provides strong formal guarantees that individual training examples cannot be reconstructed from model weights. This is the strongest technical argument that Art. 17 erasure has been "reasonably" addressed for already-trained models.
Disclosure Best Practices
Whatever lawful basis you use, transparency (GDPR Art. 13/14) requires disclosure of AI training as a purpose. Your privacy notice must include:
- Specific purpose: "We use your submitted content to fine-tune AI models that power [feature name]"
- Legal basis: State which Article 6 ground applies
- Recipients: Name any third-party model training infrastructure (AWS, Azure ML, Google Vertex AI — all non-EU jurisdictions with CLOUD Act exposure)
- Retention: How long training data is kept before deletion (separate from operational retention)
- Rights: How to exercise Art. 17 erasure for training data specifically
Avoid vague language like "improve our services." DPAs — notably the ICO's 2024 AI Auditing Framework and the CNIL's 2023 AI recommendations — have explicitly criticised vague purpose statements as insufficient for training disclosures.
The Infrastructure Jurisdiction Question
One frequently overlooked dimension: where does your AI training pipeline run?
If you fine-tune models using AWS SageMaker, Google Vertex AI, or Azure ML, your training data is processed on infrastructure subject to the CLOUD Act (50 U.S.C. § 1881a) — US law that allows the US government to compel data disclosure from US-headquartered cloud providers regardless of server location.
This affects your GDPR compliance in two ways:
-
Chapter V transfers: Sending training data to a US-controlled training environment is a transfer under GDPR Article 44+ and requires an appropriate transfer mechanism (Standard Contractual Clauses, Binding Corporate Rules, or adequacy decision). The EU-US Data Privacy Framework adequacy decision (July 2023) applies only to certified recipients — your SageMaker instance operates under AWS's DPF certification, but this does not resolve all transfer risks.
-
Art. 32 security risk: CLOUD Act compelled disclosure is a foreseeable risk that must be included in your Article 32 risk assessment and technical safeguards documentation.
Training EU user data on EU-native infrastructure — running your own GPU cluster or using EU-headquartered training infrastructure — eliminates this transfer risk entirely and simplifies your Chapter V documentation.
Key Takeaways for SaaS Developers
-
Purpose limitation applies to AI training. You need a fresh lawful basis for training if it was not disclosed as a purpose at collection time.
-
Consent is the safest basis — but must be freely given, specific, and withdrawable without affecting service access.
-
Legitimate interest is not a blanket basis. The necessity and balancing tests are genuine constraints. If anonymised or synthetic data would work, LI for personal data training fails.
-
Special category data requires explicit consent under Art. 9(2)(a) — there is no LI ground in Art. 9.
-
Art. 17 erasure from model weights is an open legal question. Differential privacy and pseudonymisation are the strongest technical defences. Build your pipeline with erasure-compatible record isolation from the start.
-
Infrastructure jurisdiction matters. Training on US cloud infrastructure is a Chapter V transfer. EU-native training infrastructure eliminates this risk.
-
Update your privacy notice. Vague "service improvement" language is insufficient. Disclosure must name training as a purpose, state the lawful basis, and explain how users can object or withdraw.
Further Reading
- EDPB Opinion 28/2023 on AI Models (adopted April 2024)
- CNIL Recommendations on AI Development (November 2023)
- ICO AI Auditing Framework (UK GDPR equivalent, June 2024)
- GDPR Articles 5, 6, 7, 9, 13, 14, 17 — particularly Recitals 39, 40, 47, 50
- EU AI Act Articles 10 (data governance for high-risk AI), 53 (GPAI copyright/TDM)
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.