2026-05-07·16 min read·

GDPR Anonymisation for AI Training Data: EU AI Act GPAI Obligations, EDPB Opinion 28/2024, and the August 2026 Deadline

Post #892 in the sota.io EU Cyber Compliance Series

Training AI models on personal data without a valid legal basis is a GDPR violation. For most SaaS developers this statement is uncontroversial in theory and routinely ignored in practice. The EU AI Act's August 2026 GPAI transparency obligations are changing that calculation. Regulators now have a second hook — distinct from GDPR enforcement — to scrutinise the data governance practices behind AI systems deployed in the EU market. The convergence of GDPR and the EU AI Act has created a compliance pinch point that developers building or deploying AI features on European user data need to navigate before the August deadline.

This guide covers the legal framework: what the August 2026 GPAI obligations actually require, how GDPR Article 5(1)(b) purpose limitation constrains the use of existing user data for model training, why the EDPB rejected legitimate interest as a basis for scraping-based training in Opinion 28/2024, when Article 89's research exception applies and when it does not, what genuine anonymisation requires (and why pseudonymisation is not enough), and how the CLOUD Act creates a secondary risk for EU organisations that run their AI training pipelines on US-owned cloud infrastructure.

The August 2026 GPAI Deadline: What Triggers

The EU AI Act (Regulation 2024/1689) entered into force on 1 August 2024. Its provisions come into effect on a staggered schedule. The obligations that matter most for developers building or using General Purpose AI (GPAI) models apply from 2 August 2026 — twenty-four months after entry into force.

GPAI model is defined in Article 3(63) as "an AI model, including where such an AI model is trained with a large amount of data using self-supervision at scale, that displays significant generality and is capable of competently performing a wide range of distinct tasks regardless of the way the model is placed on the market." This definition captures large language models, multimodal foundation models, and large embedding models regardless of whether they are open-weight or proprietary. If you are building a feature on top of a foundation model — even one you fine-tune rather than train from scratch — the model you build on is a GPAI model, and Article 53 obligations apply to the provider of that underlying model from August 2026.

Article 53 obligations for GPAI providers include:

Drawing up and keeping up to date technical documentation of the model (Art. 53(1)(a))
Providing information and documentation to downstream deployers (Art. 53(1)(b))
Complying with EU copyright law including implementing a policy to comply with Article 4(3) of the DSM Directive (the text and data mining exception and its opt-out mechanism) (Art. 53(1)(c))
Publishing a summary of the content used for training (Art. 53(1)(d))

Article 53(1)(d) is where training data governance intersects directly with GDPR. A "summary of the content used for training" must be published. For any GPAI provider that trained on data containing personal data — including web-scraped data, user-generated content, or licensed datasets with personal data — this publication obligation raises the question of what was collected, under what legal basis, and whether the data was properly anonymised before use.

Providers of GPAI models with systemic risk (broadly, models trained with more than 10^25 FLOPs — currently this covers Llama-3, GPT-4-class models, and Gemini-class models) face additional obligations under Article 55, including adversarial testing, incident reporting to the EU AI Office, and cybersecurity measures. These are relevant to developers choosing which foundation model to deploy, since they affect the compliance posture of the model provider they rely on.

What this means for SaaS developers fine-tuning models: If you fine-tune a GPAI model on your own user data, you are a GPAI provider for the fine-tuned version. Your fine-tuned model's training data governance documentation must be maintainable, your copyright compliance policy must be in place, and if your fine-tuned model is made available to other downstream parties (including your own downstream API users), you must be able to provide training data summaries. GDPR compliance for that fine-tuning data is a prerequisite for being able to honestly publish that summary.

GDPR Article 5(1)(b) requires that personal data be "collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes." This is the purpose limitation principle, and it is the central GDPR obstacle to using existing user data for AI model training.

When a SaaS user signs up and generates data — support tickets, activity logs, documents, messages, search queries — they do so in the context of using your service. The purposes for which their data was collected are defined by your privacy notice and the legal bases you invoked: typically contract performance (Art. 6(1)(b)) for the features they signed up for, possibly legitimate interests (Art. 6(1)(f)) for analytics, and consent (Art. 6(1)(a)) for optional processing. None of these legal bases say "to train AI models."

Using that data to train or fine-tune a model is further processing. GDPR Article 6(4) requires a compatibility assessment for further processing that goes beyond the original purpose. The assessment considers: the link between original and new purpose; the context and reasonable expectations of data subjects; the nature of the data; the consequences for data subjects; and the existence of safeguards. For AI training, each of these factors typically cuts against compatibility:

Link between purposes. Using customer support conversations to improve a support bot has an arguable link to the original service purpose. Using the same data to train a general-purpose assistant, to build internal productivity tools, or to develop models sold to third parties has a weak or absent link.

Reasonable expectations. Data subjects who signed up in 2021 under a privacy notice that made no mention of AI training did not reasonably expect their data to be used this way. Post-hoc changes to privacy notices do not retroactively legitimise existing processing.

Nature of the data. Support conversations and user-generated content often contain sensitive categories of data (GDPR Article 9) — health information, financial information, personal relationships — that appear incidentally but are embedded in the corpus. Art. 9 data requires explicit consent or another Art. 9(2) exception for processing; there is no Art. 9(2) exception that covers AI training.

Consequences for data subjects. Training a model on personal data creates the risk of memorisation: the model may reproduce personal data in its outputs. This is not a theoretical risk — it has been demonstrated empirically for both LLMs and diffusion models. The ICO's guidance on AI and data protection explicitly identifies model memorisation as a risk that must be assessed.

Safeguards. The existence of technical safeguards — anonymisation, differential privacy, output filtering — does not transform incompatible processing into compatible processing. Safeguards are relevant to the assessment but are not determinative.

The conclusion for most developers is that using existing user personal data for AI training requires either a new legal basis (typically consent) obtained specifically for that purpose, or genuine anonymisation of the data before it enters the training pipeline — such that it is no longer personal data when trained on.

EDPB Opinion 28/2024: Legitimate Interest Rejected for Scraping-Based Training

Several AI companies argued, in the lead-up to GDPR enforcement proceedings, that GDPR Article 6(1)(f) — legitimate interests — provides a valid legal basis for training models on publicly available web-scraped data. The European Data Protection Board (EDPB) addressed this question squarely in Opinion 28/2024 (adopted 16 December 2024), titled "on certain data protection aspects related to the processing of personal data for the development and deployment of AI models."

The EDPB's conclusions in Opinion 28/2024 on the legitimate interest question are unambiguous:

On scraping publicly available data: The EDPB concluded that "the fact that personal data has been made publicly available does not, in itself, constitute a relevant and sufficient legitimate interest for the processing." The public availability of data is not a legal basis. Data subjects who post on public forums, social media, or websites do not thereby consent to having their posts used to train commercial AI systems.

On the balancing test: Even where a legitimate interest exists, GDPR Article 6(1)(f) requires that the processing not override the fundamental rights and freedoms of data subjects. The EDPB found that the scale of personal data involved in web-scale training corpora, the lack of transparency to data subjects, the absence of meaningful consent mechanisms, and the long-term risks of model memorisation together weigh heavily against the balancing test. Scraping-based training on personal data is unlikely to satisfy the Art. 6(1)(f) balancing test in most scenarios.

On the opt-out problem: Several AI developers relied on offering opt-out mechanisms via robots.txt or specific opt-out endpoints. The EDPB found that opt-out mechanisms are not a substitute for a valid initial legal basis. You cannot process personal data without a legal basis and then argue that offering an opt-out retroactively legitimises past collection.

On anonymisation as an exit: The EDPB's Opinion 28/2024 explicitly notes that genuinely anonymised data is not personal data and is not subject to GDPR. Processing that produces genuinely anonymised data before training begins does not require a GDPR legal basis for the training processing itself — because GDPR does not apply to non-personal data. The Opinion therefore creates a strong incentive to invest in genuine anonymisation pipelines rather than relying on contested legal bases.

Opinion 28/2024 does not create new law — it interprets existing GDPR obligations as they apply to AI development. However, it is the most authoritative statement of DPA interpretation of these obligations and will be the reference document for EU DPA enforcement proceedings in 2026 and beyond.

GDPR Article 89 provides a derogation for processing of personal data for scientific research purposes. Article 89(1) permits Member States to provide derogations from Articles 15, 16, 18, and 21 (access, rectification, restriction, and objection rights) where necessary for scientific research. Recital 159 clarifies that "scientific research" includes both publicly funded and privately funded research.

Some developers and legal advisers have argued that internal AI model development constitutes "scientific research" under Art. 89, enabling the use of personal data for training under the research derogation. The EDPB Opinion 28/2024 addresses this argument and finds it inadequate in commercial AI development contexts:

The commercial purpose problem. Scientific research under GDPR is characterised by a genuine research objective, scientific methodology, and a public benefit that is at least partly independent of the commercial interests of the researcher. Training a product feature model on user data to improve commercial product performance is not scientific research within the meaning of Art. 89, regardless of what the developer labels it internally. The EDPB has consistently held that the research derogation cannot be instrumentalised for commercial processing.

The necessity requirement. Even where the research derogation applies, Article 89(1) requires that processing comply with the principle of data minimisation and that technical and organisational measures be in place to ensure respect for the data minimisation principle. If the research objective could be achieved with anonymised data, the necessity requirement of Art. 89 means that using identifiable data lacks a valid legal basis even under the research derogation.

Member State implementation gaps. Article 89 derogations require Member State implementation. Not all Member States have implemented full Art. 89 derogations, and those that have impose conditions that vary across jurisdictions. Relying on Art. 89 for pan-EU processing requires a careful jurisdiction-by-jurisdiction analysis, not a blanket assumption that the derogation applies everywhere.

The practical conclusion for developers: the Art. 89 research exception is narrow, does not cover commercial product development, and even where it applies, requires data minimisation safeguards that typically mean anonymisation should be implemented regardless.

Anonymisation Versus Pseudonymisation: Why the Distinction Is Critical

GDPR Recital 26 defines anonymisation by outcome: "The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable." Genuinely anonymised data is outside GDPR's scope entirely.

Pseudonymisation is defined in Article 4(5) as "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person." Pseudonymised data — data where identifiers have been replaced with tokens or hashes but re-identification is possible with the right key — remains personal data under GDPR. Processing pseudonymised data for AI training still requires a legal basis.

This distinction matters enormously for AI training pipelines:

Hashing identifiers is pseudonymisation, not anonymisation. Replacing user_id with SHA-256(user_id) is pseudonymisation. The hash is deterministic and consistent — if the same user appears in multiple records, the same hash appears. Statistical linkage attacks can reconstruct user identity from behavioural patterns even without the original identifier. Hashed identifiers remain personal data.

Removing names and email addresses is pseudonymisation, not anonymisation. Named entity recognition (NER) to strip person names, email addresses, and phone numbers from text is a useful step, but it does not produce anonymised data. The remaining text may still be identifiable through writing style analysis, unique vocabulary, specific facts described, timestamps, and contextual signals. Text with identifiers removed but contextual personal information retained is pseudonymised at best.

Tokenisation and encryption are pseudonymisation. Any reversible transformation is pseudonymisation by definition. Encryption preserves re-identification capability; it cannot produce anonymised data.

Genuine anonymisation for text requires more. The Article 29 Working Party Opinion 05/2014 on Anonymisation Techniques (still authoritative guidance under the EDPB) identifies three criteria a truly anonymous dataset must satisfy: it must not be possible to single out an individual; it must not be possible to link records relating to the same individual; and it must not be possible to infer information about an individual. Achieving all three for text corpora is technically challenging and requires techniques beyond simple identifier removal.

Anonymisation Techniques for AI Training Data

The following techniques, used in combination, represent the current state of practice for producing GDPR-compliant anonymised training corpora:

Differential Privacy (DP). Differential privacy adds mathematically calibrated noise to model training such that the presence or absence of any individual training record cannot be inferred from the trained model's parameters or outputs. DP-SGD (differentially private stochastic gradient descent) provides a formal privacy guarantee expressed as an epsilon-delta bound (ε, δ). Lower epsilon means stronger privacy but higher utility cost (the model is less accurate). Google, Apple, and OpenAI have published research on DP training for language models. DP provides the strongest formal guarantee that individual records cannot be extracted from trained models.

Practical DP constraints: DP requires careful engineering. The privacy accounting must be done correctly — naive implementations apply incorrect bounds. Large epsilon values (ε > 10) provide weak practical privacy despite satisfying the formal definition. The EDPB guidance on anonymisation has not yet formally endorsed specific DP epsilon thresholds, but academic literature suggests ε ≤ 1 for strong privacy and ε ≤ 8 as a practical threshold for many commercial applications.

Synthetic Data Generation. Synthetic data pipelines use the statistical properties of a real dataset to generate a new dataset that is not derived from any individual real record. Techniques include variational autoencoders, generative adversarial networks (GANs), and LLM-based synthetic generation. If the synthetic generation process itself does not memorise individual records, the resulting synthetic dataset is genuinely anonymised — it has no records that correspond to real individuals. Synthetic data is increasingly used in healthcare and financial services for exactly this reason.

Practical synthetic data constraints: The quality of synthetic data depends on the quality of the generation model. Overfitted generators produce synthetic records that closely match real records — this is essentially memorisation, not anonymisation. Membership inference tests (testing whether a real record can be distinguished from synthetic records) should be run as part of the anonymisation validation pipeline.

Aggressive Identifier Removal with Contextual Review. Beyond NER-based removal of explicit identifiers, genuinely anonymised text requires: removal of quasi-identifiers (combinations of non-identifying facts that together identify individuals); removal of rare events and outlier descriptions (unique events described in only one record); temporal binning (replacing specific dates with date ranges); geographic generalisation (replacing specific postcodes with regions); and semantic review of remaining text to identify contextual re-identification risks. This is labour-intensive and benefits from automation via purpose-built PII detection models.

k-Anonymity and l-Diversity for Structured Data. For structured or tabular data (not text), k-anonymity (ensuring every record is identical on quasi-identifier attributes to at least k-1 other records) and l-diversity (ensuring each equivalence class has at least l distinct sensitive values) are established techniques. These are not applicable directly to text, but are relevant for structured training labels, demographic attributes, and ground truth datasets.

Output Filtering. Post-training output filters that detect and redact personal data from model outputs are not an anonymisation technique — they do not make the training data anonymous — but they are an important safeguard for reducing the harm from model memorisation. Canary insertion testing (embedding unique test strings in training data to measure whether the model memorises them) is a useful audit technique.

Anonymisation Validation. GDPR-compliant anonymisation is not achieved by applying techniques — it is achieved by demonstrating, through testing, that re-identification is not reasonably likely. The EDPB and Article 29 WP guidance emphasises that anonymisation is an outcome, not a process. Validation should include: membership inference attacks (can the model distinguish training members from non-members?); attribute inference attacks (can known attributes predict unknown attributes?); and linkage tests (can records in the anonymised dataset be linked to records in external datasets?).

The CLOUD Act Problem for AI Training Pipelines

EU organisations that build GDPR-compliant anonymised training datasets and then run their training pipelines on US-owned cloud infrastructure face a secondary risk that is less well understood: the CLOUD Act applies to the training infrastructure, not just the data stored at rest.

Training compute on US-owned platforms. Amazon SageMaker (Amazon Web Services Inc., Delaware), Google Vertex AI (Google LLC, Delaware), Azure Machine Learning (Microsoft Corporation, Washington State) — all are operated by US-incorporated entities subject to the CLOUD Act. When you run a training job on these platforms, the data in your training corpus passes through infrastructure that the CLOUD Act obligates the platform operator to disclose to US law enforcement on valid demand.

The anonymisation intersection. If your training data is genuinely anonymised, the CLOUD Act exposure is limited — genuinely anonymised data that cannot be traced to individuals carries lower regulatory risk even if disclosed. However, several problems remain:

First, the anonymisation may not be provably complete. If a US law enforcement demand arrives during or after training, and the platform operator discloses training data, there may be disputes about whether the anonymisation was effective. DPA enforcement can follow even where anonymisation was attempted but imperfect.

Second, model weights are potentially sensitive. A model trained on EU personal data — even imperfectly anonymised — encodes statistical properties of that data in its weights. Model weights themselves are not personal data in the traditional sense, but model weights trained on personal data are the product of personal data processing. If those weights are stored on US-owned infrastructure, a CLOUD Act demand for the weights would be legally novel but not impossible.

Third, the training pipeline includes personal data at intermediate stages. The anonymisation process itself is a stage of processing. If anonymisation happens on US-owned compute infrastructure, personal data exists on that infrastructure — even transiently — before anonymisation is complete. This is a transfer to the US under GDPR Article 44, requiring SCCs or another transfer mechanism, and it is subject to CLOUD Act access by the platform operator before anonymisation is applied.

The CLOUD Act-clean approach. Running AI training pipelines on EU-sovereign infrastructure operated by EU-incorporated entities eliminates CLOUD Act exposure entirely. The platform operator has no US law obligations; there is no US entity that can receive a CLOUD Act demand; SCCs are not needed for the training compute transfer because there is no transfer to a third country.

EU-sovereign compute options for AI training include Scaleway (Scaleway SAS, France, COMPUTE GPU H100 instances), OVHcloud (OVH SAS, France, AI Training), Hetzner (Hetzner Online GmbH, Germany, dedicated GPU servers), and Deutsche Telekom T-Systems (DE), among others. Deploying training pipelines on EU-sovereign infrastructure, combined with genuine anonymisation of training data, provides a defensible compliance posture under both GDPR and the EU AI Act.

The following checklist summarises the obligations described above:

Step 1 — Legal Basis Audit

Document the original legal basis under which each dataset was collected
Assess compatibility between original purpose and training purpose (Art. 6(4))
If incompatible: obtain specific consent for AI training OR proceed only with genuinely anonymised data
Check for Art. 9 sensitive data in training corpus — requires specific exception even for anonymised pipelines during collection
Document your legal basis analysis in your AI Act Art. 53 technical documentation

Step 2 — Data Minimisation

Identify the minimum dataset size and minimum data granularity required for your training objective
Apply temporal limits: only use data from periods relevant to the training objective
Remove data subjects who exercised erasure rights before data enters the training pipeline
Remove data collected from users who opted out of AI training

Step 3 — Anonymisation Pipeline

Apply NER-based identifier removal (names, emails, phone numbers, addresses)
Remove quasi-identifiers and rare events
Apply geographic and temporal generalisation
Run membership inference tests on anonymised output
Document anonymisation methodology and validation results
Consider DP-SGD for fine-tuning if dataset contains difficult-to-anonymise content

Step 4 — Infrastructure

Confirm training compute is on EU-sovereign infrastructure or assess CLOUD Act risk for US-owned compute
If using US-owned compute: anonymise data before transfer; implement SCCs; document transfer impact assessment
Store model weights in EU-sovereign storage if models encode sensitive domain knowledge
Document infrastructure provenance in AI Act technical documentation

Step 5 — Rights Management

Implement an erasure pipeline that removes data subjects from future training datasets when erasure requests are received
Document that past fine-tuning runs cannot retroactively apply erasure (explain in privacy notice)
Consider "machine unlearning" techniques for production models where erasure of memorised data is required
Publish training data summary (Art. 53(1)(d)) that describes anonymisation measures without disclosing personal data

Step 6 — Ongoing Compliance

Maintain training data documentation for at least the lifetime of the model plus the applicable GDPR documentation retention period
Run periodic re-anonymisation audits as the training corpus grows
Monitor EDPB guidance updates on AI and anonymisation — this area is actively evolving
Monitor EU AI Office publications on GPAI code of practice compliance (Art. 56)

EU-Sovereign AI Training Infrastructure

For teams building GDPR-compliant AI training pipelines entirely on EU infrastructure, the relevant options span multiple layers of the training stack:

Training Compute. GPU instances for model fine-tuning and training are available from: Scaleway (H100/A100 GPUs, Paris/Amsterdam data centres, Scaleway SAS, France); OVHcloud (AI Training with A100 clusters, OVH SAS, France); Lambda Labs' EU data centre (caution: Lambda Labs Inc. is US-incorporated — verify entity); Hetzner GPU servers (dedicated, Germany); Deutsche Telekom T-Systems Open Telekom Cloud (Germany). For larger training runs, European HPC resources such as LUMI (CSC, Finland) and JUWELS (FZ Jülich, Germany) are available to research institutions.

Data Processing and Storage. S3-compatible EU-sovereign object storage: Scaleway Object Storage, OVHcloud Object Storage (S3), Cloudferro (Wrocław, Poland — EU and EEA data centres), Infomaniak (Geneva, Switzerland). Infomaniak is EEA but not EU — GDPR applies in Switzerland under the adequacy decision.

MLOps and Experiment Tracking. Self-hosted MLflow on EU infrastructure; self-hosted DVC on EU infrastructure; Weights & Biases has EU data residency options but operates as Weights & Biases Inc. (US) — verify current DPA and GDPR status before use.

Model Serving. EU-sovereign model serving via sota.io (managed PaaS, EU jurisdiction), self-hosted on Hetzner or Scaleway VM, or using EU-native Kubernetes via managed services (IONOS Managed Kubernetes, OVHcloud Managed Kubernetes).

Conclusion

The August 2026 EU AI Act GPAI deadline is the visible horizon, but the underlying GDPR obligations on AI training data exist now. EDPB Opinion 28/2024 has clarified that legitimate interest does not cover scraping-based training, that pseudonymisation is not anonymisation, and that genuinely anonymous data is the cleanest path to a compliant training pipeline. The Article 89 research exception does not cover commercial product development. For SaaS developers fine-tuning models on user data, the compliance path requires either informed consent specific to AI training or a proper anonymisation pipeline with validated output.

The CLOUD Act adds a second dimension: EU organisations that route training pipelines through US-owned infrastructure — even with anonymised data — face CLOUD Act access risk during the processing window before anonymisation is complete, and potentially for model weights that encode personal data patterns. Running the entire pipeline on EU-sovereign infrastructure eliminates this exposure.

sota.io provides managed EU-native deployment infrastructure for AI-enabled applications, running exclusively on EU-incorporated cloud providers without US-parent exposure. Applications trained, validated, and deployed on EU-sovereign infrastructure can document both GDPR compliance and CLOUD Act-clean provenance in a single infrastructure decision.

See also:

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Start free — no credit card View pricing