2026-05-05·12 min read·

GDPR Pseudonymisation vs Anonymisation: What Actually Counts as Personal Data for SaaS Developers — Developer Guide 2026

One of the most common compliance mistakes in SaaS development is treating pseudonymous data as anonymous. Engineers hash email addresses, truncate IP addresses, or replace user IDs with opaque tokens — and then conclude the resulting dataset falls outside GDPR. It does not. GDPR Recital 26 establishes a specific test for what counts as "truly anonymous," and most data that has been through a reversible or brute-forceable transformation fails it.

This guide explains the legal line between pseudonymisation and anonymisation, why the most popular "anonymisation" techniques do not actually work, what proper anonymisation requires in practice, and where pseudonymisation — even though it keeps data in GDPR scope — still earns you meaningful compliance benefits.


The Regulatory Framework: Recital 26 and the Identifiability Test

GDPR Article 2(1) defines the regulation's material scope as applying to "processing of personal data." Article 4(1) defines personal data as "any information relating to an identified or identifiable natural person." The key operative word is identifiable — not merely identified.

Recital 26 provides the test: data is not personal data only where the natural person is "not identifiable" — specifically, where identification is impossible taking into account "all the means reasonably likely to be used, such as singling out, either by the controller or by any other person." The assessment must consider "all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments."

Two immediate consequences for developers:

  1. The test is dynamic — data that is genuinely anonymous today might become personal data as computing costs drop or new datasets become publicly available.
  2. The test is contextual — you must consider not just your own re-identification capability but that of any party who might obtain the data.

This is why "we deleted the email address" is not a sufficient anonymisation argument. If there is any other field in the dataset, or any external dataset, that could be combined with yours to re-identify an individual, the Recital 26 test fails.


GDPR Definitions: Pseudonymisation vs Anonymisation

Pseudonymisation (Art.4(5))

GDPR defines pseudonymisation as:

"the processing of personal data in a manner such that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."

Three elements make data pseudonymous rather than anonymous:

Pseudonymous data is still personal data and is fully in scope of GDPR. The controller cannot lawfully process it as if it were anonymous.

Anonymisation (Recital 26)

There is no formal definition of anonymisation in the GDPR text — only the Recital 26 test. The practical standard comes from WP29 Opinion 05/2014 on Anonymisation Techniques (adopted by the EDPB), which requires that anonymised data makes it:

  1. Impossible to single out an individual.
  2. Impossible to link records relating to the same individual.
  3. Impossible to infer any information about an individual.

All three criteria must be met simultaneously and must remain met under "reasonably likely" re-identification attacks using technology available now and in the near future. Meeting two out of three is not anonymisation.


Why Common "Anonymisation" Techniques Fail

MD5 or SHA-256 Email Hashing

The most frequent mistake. A developer hashes user@example.com to 5d41402abc4b2a76b9719d911017c592 and considers the result anonymous.

Why it fails the Recital 26 test:

WP29 Opinion 05/2014 explicitly classified deterministic hashing of low-entropy values (email addresses, phone numbers, national IDs) as pseudonymisation, not anonymisation.

The correct treatment: hashed email addresses are pseudonymous identifiers. They attract full GDPR obligations.

Truncated IP Addresses

Removing the last octet of an IPv4 address (e.g., 192.168.1.xxx) is a common analytics practice. It fails because:

The German Federal Commissioner for Data Protection (BfDI) and the CJEU case Breyer v. Germany (C-582/14) established that dynamic IP addresses are personal data when held alongside server logs that could be combined with ISP data to identify the subscriber. A truncated IP address in the same dataset is better characterised as pseudonymous.

Replacing User IDs with Random Tokens

Substituting user_id = 7341 with user_id = 8f3a2b1c looks like anonymisation but is simply pseudonymisation with a lookup table. The token is a direct pseudonym — re-identification requires only the mapping table.

This is not a problem to avoid; it is the definition of pseudonymisation. The issue is treating it as anonymisation.

Removing Obvious Direct Identifiers

A dataset that removes name, email, national ID, and phone number but retains age, zip code, gender, and occupation fails the singling-out test. The landmark study by Latanya Sweeney (2000) demonstrated that 87% of the US population can be uniquely identified using only three fields: date of birth, gender, and five-digit ZIP code. For European data, similar quasi-identifier combinations (birth year + municipality + occupation) provide high re-identification rates.

WP29 called this the "quasi-identifier" problem. Removing obvious identifiers while retaining enough quasi-identifiers fails Recital 26.


What Proper Anonymisation Actually Requires

The WP29/EDPB Technique Taxonomy

WP29 Opinion 05/2014 evaluates five main anonymisation approaches against the singling-out / linkability / inference triad:

TechniqueSingling-outLinkabilityInferenceAssessment
Noise additionPartialPartialPartialAlone: insufficient
Substitution (pseudonymisation)NoNoNoNot anonymisation
Aggregation / generalisationPartialPartialPartialDepends on k
Data suppressionPartialNoPartialContext-dependent
Data swappingPartialNoPartialRisky

No single technique reliably satisfies all three tests simultaneously. Proper anonymisation typically combines multiple techniques.

K-Anonymity

K-anonymity requires that each record in a published dataset is indistinguishable from at least k-1 other records with respect to their quasi-identifier fields. If k = 5, no individual is uniquely singled out — they share characteristics with at least 4 others.

EDPB practical guidance (derived from WP29 and national DPA positions):

K-anonymity alone does not prevent inference attacks. If all k records sharing the same quasi-identifier also share the same sensitive attribute value (e.g., all 5 users in a zip-age cell have the same medical condition), an attacker can infer the sensitive attribute even without identifying the specific individual.

L-Diversity

L-diversity extends k-anonymity by requiring that each equivalence class contains at least l distinct values for each sensitive attribute. This prevents attribute inference. A dataset satisfies (k=5, l=3)-anonymity if every group of 5 quasi-identifier-identical records contains at least 3 distinct values for sensitive attributes.

Differential Privacy

Differential privacy provides a mathematically rigorous anonymisation guarantee. A mechanism M is ε-differentially private if for any two datasets D and D' differing in exactly one record:

Pr[M(D) ∈ S] ≤ e^ε × Pr[M(D') ∈ S]

The parameter ε (epsilon) is the privacy budget. Smaller ε means stronger privacy but less accuracy.

Practical thresholds:

Differential privacy is used in production by Google (RAPPOR), Apple (telemetry collection), and the US Census Bureau (2020 Census). For SaaS aggregate analytics, the Laplace mechanism or Gaussian mechanism over counts and sums provides the best accuracy-privacy trade-off.

Aggregation with Minimum Count Thresholds

The simplest workable approach for many SaaS analytics use cases: publish only aggregate counts where every cell value represents at least n distinct individuals, suppress cells below the threshold, and add rounding noise.

Common thresholds:

The German DPA (DSK) has published guidance for health research requiring minimum cell sizes of 3-5 with rounding to the nearest 5. The UK ICO guidance for statistical disclosure control suggests similar thresholds for aggregate data publication.


Pseudonymisation Benefits Under GDPR

Even though pseudonymised data remains personal data and subject to full GDPR obligations, pseudonymisation is explicitly rewarded across multiple articles.

Art.25 — Privacy by Design and by Default

Article 25(1) requires controllers to implement "appropriate technical and organisational measures... such as pseudonymisation" both at the time of design and by default. Implementing robust pseudonymisation (separate key storage, per-tenant key rotation, encryption at rest) satisfies an explicit Privacy by Design obligation and provides evidence of compliance in DPA audits.

Art.32 — Security of Processing

Article 32(1)(a) lists "pseudonymisation and encryption of personal data" as an example of an appropriate technical measure for security. In breach notification assessments under Art.33 and Art.34, whether the breached data was pseudonymised significantly affects whether notification to data subjects is required.

EDPB Guidelines 9/2022 on personal data breach notification state that if the attacker obtained only pseudonymous tokens without the key, the "high risk to the rights and freedoms of natural persons" threshold for Art.34 notification to individuals may not be met — even if Art.33 notification to the supervisory authority is still required.

This can be the difference between a contained incident and a mass notification exercise.

Art.89 — Research, Statistics, Archiving

Article 89(1) requires that processing for research, statistics, or archiving purposes includes "safeguards... such as pseudonymisation." Member states may provide derogations from data subject rights (Art.15-22) for such processing — but those derogations typically require pseudonymisation as a condition.

Truly anonymised data used for research falls entirely outside GDPR (Recital 26). Pseudonymised data can access the Art.89 derogations pathway with appropriate safeguards, making it much easier to conduct longitudinal cohort analysis on SaaS usage data than if you use fully identified data.

Art.6(4)(e) — Compatible Purpose Test

When assessing whether further processing is compatible with the original purpose under Art.6(4), one of the five factors is "the existence of appropriate safeguards, which may include encryption or pseudonymisation." Pseudonymisation makes it easier to justify secondary analytics on data collected for service delivery — a significant practical benefit for product teams running usage analysis.

Art.28 — Processor Agreements

Pseudonymisation in transit and at rest reduces the blast radius of processor incidents. A sub-processor who processes only pseudonymous tokens and has no access to the key mapping cannot re-identify data subjects — limiting both your Art.28 liability exposure and the sub-processor's own GDPR obligations.


Practical Implementation for SaaS Architectures

Tenant Analytics

The most common use case: you want to analyse cross-tenant usage patterns without exposing individual user behaviour.

Wrong approach: Hash user_id, publish the hash alongside feature usage events.

Correct approach (pseudonymised, not anonymous):

import secrets
import hashlib
from datetime import date

class TenantPseudonymiser:
    def __init__(self, daily_key: bytes):
        # Key rotated daily — limits linkability across days
        self.key = daily_key

    def pseudonymise(self, user_id: int, tenant_id: str) -> str:
        # HMAC with rotating key: reversible only with key
        import hmac
        payload = f"{tenant_id}:{user_id}".encode()
        return hmac.new(self.key, payload, hashlib.sha256).hexdigest()[:16]

This is pseudonymisation with key rotation — it limits temporal linkability while keeping records linkable within a day. It does NOT produce anonymous data. Store the daily key in a separate secret store (e.g., AWS Secrets Manager, HashiCorp Vault) with restricted access.

Aggregate Analytics (Anonymisation Attempt)

To produce genuinely anonymous aggregate statistics:

from dataclasses import dataclass
from typing import Dict, List, Optional
import math
import secrets

@dataclass
class AggregateCell:
    dimension_values: Dict[str, str]
    count: int
    is_suppressed: bool
    noise_applied: float

class AnonymisationChecker:
    def __init__(
        self,
        k_min: int = 5,
        sensitivity_level: str = "standard",
        epsilon: float = 1.0
    ):
        self.k_min = k_min
        self.sensitivity_level = sensitivity_level
        self.epsilon = epsilon  # differential privacy budget

        # Override k_min for sensitive contexts
        if sensitivity_level == "health":
            self.k_min = max(k_min, 20)
        elif sensitivity_level == "employment":
            self.k_min = max(k_min, 10)

    def apply_laplace_noise(self, true_count: int, sensitivity: float = 1.0) -> float:
        scale = sensitivity / self.epsilon
        noise = secrets.SystemRandom().uniform(0, 1)
        # Laplace noise via inverse CDF
        u = noise - 0.5
        return true_count - scale * math.copysign(1, u) * math.log(1 - 2 * abs(u))

    def process_cell(self, cell: AggregateCell) -> AggregateCell:
        if cell.count < self.k_min:
            return AggregateCell(
                dimension_values=cell.dimension_values,
                count=0,
                is_suppressed=True,
                noise_applied=0.0
            )
        noisy_count = self.apply_laplace_noise(cell.count)
        # Round to nearest 5 for additional disclosure control
        rounded = round(max(noisy_count, self.k_min) / 5) * 5
        return AggregateCell(
            dimension_values=cell.dimension_values,
            count=int(rounded),
            is_suppressed=False,
            noise_applied=noisy_count - cell.count
        )

    def check_l_diversity(
        self,
        cells: List[AggregateCell],
        sensitive_attribute_key: str,
        l_min: int = 3
    ) -> List[str]:
        warnings = []
        for cell in cells:
            if not cell.is_suppressed:
                attr_val = cell.dimension_values.get(sensitive_attribute_key)
                if attr_val is not None:
                    # In a real implementation, check that the equivalence class
                    # containing this cell has l_min distinct attribute values
                    warnings.append(
                        f"Cell {cell.dimension_values}: verify l-diversity "
                        f"(l≥{l_min}) for sensitive attribute '{sensitive_attribute_key}'"
                    )
        return warnings

    def assess_anonymisation(self, raw_count: int) -> dict:
        return {
            "is_anonymous": raw_count >= self.k_min,
            "k_threshold": self.k_min,
            "action": "publish_with_noise" if raw_count >= self.k_min else "suppress",
            "epsilon": self.epsilon,
            "sensitivity_level": self.sensitivity_level,
            "regulatory_basis": "WP29 Opinion 05/2014 + EDPB Guidelines"
        }

IP Address Handling

Rather than truncating IP addresses (which provides limited anonymisation), consider:

Never: store raw IP addresses in analytics pipelines and call them "anonymised" after truncation.


Compliance Checklist — 25 Items

Data Classification (5 items)

Pseudonymisation Implementation (7 items)

Anonymisation Validation (6 items)

Breach Response (4 items)

Research and Analytics (3 items)


Infrastructure Jurisdiction and Pseudonymisation

For SaaS hosted on EU-sovereign infrastructure, pseudonymisation has an additional practical benefit: the key mapping and the pseudonymised data can be held entirely within the EU, and access from outside the EU can be restricted at the infrastructure level.

For SaaS using US-parent cloud providers, pseudonymisation does not resolve CLOUD Act exposure. The CLOUD Act allows US authorities to compel disclosure of data held by US persons or entities regardless of where the data is stored. A pseudonymous dataset held on AWS or Azure remains subject to potential CLOUD Act disclosure — and if the key mapping is also held there, the effective protection is limited. The CJEU in Schrems II (Case C-311/18) confirmed that technical pseudonymisation does not neutralise US intelligence law access risks where the key is accessible to the US entity.

EU-native infrastructure where neither the pseudonymised data nor the key mapping is ever processed by a US entity provides the strongest position under GDPR and removes the CLOUD Act overlap entirely.


Summary

Identified DataPseudonymous DataAnonymous Data
GDPR applies?YesYesNo (Recital 26)
Data subject rights?Full (Art.15-22)FullNone
Breach notification?Art.33 + Art.34Art.33 + Art.34 assessmentNo obligation
Art.89 research derogation?No (usually)Yes (with safeguards)N/A
Art.25 Privacy by Design credit?NoYesN/A
Art.32 security measure credit?NoYes (reduced breach severity)N/A
Typical techniquesHMAC-SHA256, encryption, tokenisationk-anonymity + DP + suppression

The practical takeaway: pseudonymisation is not a path out of GDPR, but it is a well-recognised tool that reduces breach severity, enables research derogations, and satisfies Privacy by Design obligations. True anonymisation is genuinely difficult, requires combining multiple techniques, and must withstand the Recital 26 re-identification test — including future technology developments.

Most SaaS developers should default to implementing robust pseudonymisation (keyed HMAC, separate key storage, key rotation) rather than attempting anonymisation that will not survive regulatory scrutiny. Reserve anonymisation efforts for specific use cases — external data sharing, research publication, aggregate reporting — where the full validation process is applied.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.