GDPR Pseudonymisation vs Anonymisation: What Actually Counts as Personal Data for SaaS Developers — Developer Guide 2026
One of the most common compliance mistakes in SaaS development is treating pseudonymous data as anonymous. Engineers hash email addresses, truncate IP addresses, or replace user IDs with opaque tokens — and then conclude the resulting dataset falls outside GDPR. It does not. GDPR Recital 26 establishes a specific test for what counts as "truly anonymous," and most data that has been through a reversible or brute-forceable transformation fails it.
This guide explains the legal line between pseudonymisation and anonymisation, why the most popular "anonymisation" techniques do not actually work, what proper anonymisation requires in practice, and where pseudonymisation — even though it keeps data in GDPR scope — still earns you meaningful compliance benefits.
The Regulatory Framework: Recital 26 and the Identifiability Test
GDPR Article 2(1) defines the regulation's material scope as applying to "processing of personal data." Article 4(1) defines personal data as "any information relating to an identified or identifiable natural person." The key operative word is identifiable — not merely identified.
Recital 26 provides the test: data is not personal data only where the natural person is "not identifiable" — specifically, where identification is impossible taking into account "all the means reasonably likely to be used, such as singling out, either by the controller or by any other person." The assessment must consider "all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments."
Two immediate consequences for developers:
- The test is dynamic — data that is genuinely anonymous today might become personal data as computing costs drop or new datasets become publicly available.
- The test is contextual — you must consider not just your own re-identification capability but that of any party who might obtain the data.
This is why "we deleted the email address" is not a sufficient anonymisation argument. If there is any other field in the dataset, or any external dataset, that could be combined with yours to re-identify an individual, the Recital 26 test fails.
GDPR Definitions: Pseudonymisation vs Anonymisation
Pseudonymisation (Art.4(5))
GDPR defines pseudonymisation as:
"the processing of personal data in a manner such that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."
Three elements make data pseudonymous rather than anonymous:
- Reversibility exists: Someone holding the "additional information" (the mapping key, the hash salt, the encryption key) can re-identify.
- Separation requirement: That additional information must be kept separately.
- Organisational controls required: Access controls, separate storage, separate key management.
Pseudonymous data is still personal data and is fully in scope of GDPR. The controller cannot lawfully process it as if it were anonymous.
Anonymisation (Recital 26)
There is no formal definition of anonymisation in the GDPR text — only the Recital 26 test. The practical standard comes from WP29 Opinion 05/2014 on Anonymisation Techniques (adopted by the EDPB), which requires that anonymised data makes it:
- Impossible to single out an individual.
- Impossible to link records relating to the same individual.
- Impossible to infer any information about an individual.
All three criteria must be met simultaneously and must remain met under "reasonably likely" re-identification attacks using technology available now and in the near future. Meeting two out of three is not anonymisation.
Why Common "Anonymisation" Techniques Fail
MD5 or SHA-256 Email Hashing
The most frequent mistake. A developer hashes user@example.com to 5d41402abc4b2a76b9719d911017c592 and considers the result anonymous.
Why it fails the Recital 26 test:
- Dictionary attack feasibility: There are approximately 4 billion active email addresses globally. A precomputed hash table of common
firstname.lastname@domain.compatterns, all public domain variants, and known breach datasets can cover a substantial fraction of email space in minutes on commodity hardware. - Known-plaintext: If an attacker knows even one email address in your dataset, they can verify which hash corresponds to it. The hash is then a linking token, not an anonymisation mechanism.
- Email normalization gaps:
User@Example.COM,user+tag@example.com, anduser@example.comoften hash to different values but correspond to the same person — creating inconsistency without providing anonymity.
WP29 Opinion 05/2014 explicitly classified deterministic hashing of low-entropy values (email addresses, phone numbers, national IDs) as pseudonymisation, not anonymisation.
The correct treatment: hashed email addresses are pseudonymous identifiers. They attract full GDPR obligations.
Truncated IP Addresses
Removing the last octet of an IPv4 address (e.g., 192.168.1.xxx) is a common analytics practice. It fails because:
- IPv4 space is finite: A /24 subnet contains 256 addresses. With typical ISP dynamic address allocation patterns, time-of-day correlation, and known ISP-to-geography mappings, the effective anonymity set is often far smaller than 256.
- IPv6 makes it worse: IPv6 addresses are typically statically assigned per interface. A truncated IPv6 address may uniquely identify a device on a /64 network where only one device exists.
- Log correlation: Your web server logs, CDN logs, and application logs may contain the full IP address. Truncating only one copy while retaining others defeats the purpose.
The German Federal Commissioner for Data Protection (BfDI) and the CJEU case Breyer v. Germany (C-582/14) established that dynamic IP addresses are personal data when held alongside server logs that could be combined with ISP data to identify the subscriber. A truncated IP address in the same dataset is better characterised as pseudonymous.
Replacing User IDs with Random Tokens
Substituting user_id = 7341 with user_id = 8f3a2b1c looks like anonymisation but is simply pseudonymisation with a lookup table. The token is a direct pseudonym — re-identification requires only the mapping table.
This is not a problem to avoid; it is the definition of pseudonymisation. The issue is treating it as anonymisation.
Removing Obvious Direct Identifiers
A dataset that removes name, email, national ID, and phone number but retains age, zip code, gender, and occupation fails the singling-out test. The landmark study by Latanya Sweeney (2000) demonstrated that 87% of the US population can be uniquely identified using only three fields: date of birth, gender, and five-digit ZIP code. For European data, similar quasi-identifier combinations (birth year + municipality + occupation) provide high re-identification rates.
WP29 called this the "quasi-identifier" problem. Removing obvious identifiers while retaining enough quasi-identifiers fails Recital 26.
What Proper Anonymisation Actually Requires
The WP29/EDPB Technique Taxonomy
WP29 Opinion 05/2014 evaluates five main anonymisation approaches against the singling-out / linkability / inference triad:
| Technique | Singling-out | Linkability | Inference | Assessment |
|---|---|---|---|---|
| Noise addition | Partial | Partial | Partial | Alone: insufficient |
| Substitution (pseudonymisation) | No | No | No | Not anonymisation |
| Aggregation / generalisation | Partial | Partial | Partial | Depends on k |
| Data suppression | Partial | No | Partial | Context-dependent |
| Data swapping | Partial | No | Partial | Risky |
No single technique reliably satisfies all three tests simultaneously. Proper anonymisation typically combines multiple techniques.
K-Anonymity
K-anonymity requires that each record in a published dataset is indistinguishable from at least k-1 other records with respect to their quasi-identifier fields. If k = 5, no individual is uniquely singled out — they share characteristics with at least 4 others.
EDPB practical guidance (derived from WP29 and national DPA positions):
- k < 3: Insufficient for most contexts.
- k = 5: Minimum for low-sensitivity analytics data (page view aggregates, feature usage).
- k ≥ 10: Expected for sensitive context analytics (health-adjacent SaaS, employment data, financial services).
- k ≥ 20 or suppression: Required for health data or special categories under Art.9.
K-anonymity alone does not prevent inference attacks. If all k records sharing the same quasi-identifier also share the same sensitive attribute value (e.g., all 5 users in a zip-age cell have the same medical condition), an attacker can infer the sensitive attribute even without identifying the specific individual.
L-Diversity
L-diversity extends k-anonymity by requiring that each equivalence class contains at least l distinct values for each sensitive attribute. This prevents attribute inference. A dataset satisfies (k=5, l=3)-anonymity if every group of 5 quasi-identifier-identical records contains at least 3 distinct values for sensitive attributes.
Differential Privacy
Differential privacy provides a mathematically rigorous anonymisation guarantee. A mechanism M is ε-differentially private if for any two datasets D and D' differing in exactly one record:
Pr[M(D) ∈ S] ≤ e^ε × Pr[M(D') ∈ S]
The parameter ε (epsilon) is the privacy budget. Smaller ε means stronger privacy but less accuracy.
Practical thresholds:
- ε ≤ 0.1: Strong privacy, significant accuracy loss. Use for highly sensitive outputs.
- ε = 1.0: Commonly cited as the boundary of "meaningful" differential privacy (NIST SP 800-226).
- ε > 10: Privacy guarantees become marginal. Often insufficient for regulatory purposes.
Differential privacy is used in production by Google (RAPPOR), Apple (telemetry collection), and the US Census Bureau (2020 Census). For SaaS aggregate analytics, the Laplace mechanism or Gaussian mechanism over counts and sums provides the best accuracy-privacy trade-off.
Aggregation with Minimum Count Thresholds
The simplest workable approach for many SaaS analytics use cases: publish only aggregate counts where every cell value represents at least n distinct individuals, suppress cells below the threshold, and add rounding noise.
Common thresholds:
- General analytics: n ≥ 5
- Employment or sensitive analytics: n ≥ 10
- Health data: n ≥ 20 (some DPA guidance goes higher)
The German DPA (DSK) has published guidance for health research requiring minimum cell sizes of 3-5 with rounding to the nearest 5. The UK ICO guidance for statistical disclosure control suggests similar thresholds for aggregate data publication.
Pseudonymisation Benefits Under GDPR
Even though pseudonymised data remains personal data and subject to full GDPR obligations, pseudonymisation is explicitly rewarded across multiple articles.
Art.25 — Privacy by Design and by Default
Article 25(1) requires controllers to implement "appropriate technical and organisational measures... such as pseudonymisation" both at the time of design and by default. Implementing robust pseudonymisation (separate key storage, per-tenant key rotation, encryption at rest) satisfies an explicit Privacy by Design obligation and provides evidence of compliance in DPA audits.
Art.32 — Security of Processing
Article 32(1)(a) lists "pseudonymisation and encryption of personal data" as an example of an appropriate technical measure for security. In breach notification assessments under Art.33 and Art.34, whether the breached data was pseudonymised significantly affects whether notification to data subjects is required.
EDPB Guidelines 9/2022 on personal data breach notification state that if the attacker obtained only pseudonymous tokens without the key, the "high risk to the rights and freedoms of natural persons" threshold for Art.34 notification to individuals may not be met — even if Art.33 notification to the supervisory authority is still required.
This can be the difference between a contained incident and a mass notification exercise.
Art.89 — Research, Statistics, Archiving
Article 89(1) requires that processing for research, statistics, or archiving purposes includes "safeguards... such as pseudonymisation." Member states may provide derogations from data subject rights (Art.15-22) for such processing — but those derogations typically require pseudonymisation as a condition.
Truly anonymised data used for research falls entirely outside GDPR (Recital 26). Pseudonymised data can access the Art.89 derogations pathway with appropriate safeguards, making it much easier to conduct longitudinal cohort analysis on SaaS usage data than if you use fully identified data.
Art.6(4)(e) — Compatible Purpose Test
When assessing whether further processing is compatible with the original purpose under Art.6(4), one of the five factors is "the existence of appropriate safeguards, which may include encryption or pseudonymisation." Pseudonymisation makes it easier to justify secondary analytics on data collected for service delivery — a significant practical benefit for product teams running usage analysis.
Art.28 — Processor Agreements
Pseudonymisation in transit and at rest reduces the blast radius of processor incidents. A sub-processor who processes only pseudonymous tokens and has no access to the key mapping cannot re-identify data subjects — limiting both your Art.28 liability exposure and the sub-processor's own GDPR obligations.
Practical Implementation for SaaS Architectures
Tenant Analytics
The most common use case: you want to analyse cross-tenant usage patterns without exposing individual user behaviour.
Wrong approach: Hash user_id, publish the hash alongside feature usage events.
Correct approach (pseudonymised, not anonymous):
import secrets
import hashlib
from datetime import date
class TenantPseudonymiser:
def __init__(self, daily_key: bytes):
# Key rotated daily — limits linkability across days
self.key = daily_key
def pseudonymise(self, user_id: int, tenant_id: str) -> str:
# HMAC with rotating key: reversible only with key
import hmac
payload = f"{tenant_id}:{user_id}".encode()
return hmac.new(self.key, payload, hashlib.sha256).hexdigest()[:16]
This is pseudonymisation with key rotation — it limits temporal linkability while keeping records linkable within a day. It does NOT produce anonymous data. Store the daily key in a separate secret store (e.g., AWS Secrets Manager, HashiCorp Vault) with restricted access.
Aggregate Analytics (Anonymisation Attempt)
To produce genuinely anonymous aggregate statistics:
from dataclasses import dataclass
from typing import Dict, List, Optional
import math
import secrets
@dataclass
class AggregateCell:
dimension_values: Dict[str, str]
count: int
is_suppressed: bool
noise_applied: float
class AnonymisationChecker:
def __init__(
self,
k_min: int = 5,
sensitivity_level: str = "standard",
epsilon: float = 1.0
):
self.k_min = k_min
self.sensitivity_level = sensitivity_level
self.epsilon = epsilon # differential privacy budget
# Override k_min for sensitive contexts
if sensitivity_level == "health":
self.k_min = max(k_min, 20)
elif sensitivity_level == "employment":
self.k_min = max(k_min, 10)
def apply_laplace_noise(self, true_count: int, sensitivity: float = 1.0) -> float:
scale = sensitivity / self.epsilon
noise = secrets.SystemRandom().uniform(0, 1)
# Laplace noise via inverse CDF
u = noise - 0.5
return true_count - scale * math.copysign(1, u) * math.log(1 - 2 * abs(u))
def process_cell(self, cell: AggregateCell) -> AggregateCell:
if cell.count < self.k_min:
return AggregateCell(
dimension_values=cell.dimension_values,
count=0,
is_suppressed=True,
noise_applied=0.0
)
noisy_count = self.apply_laplace_noise(cell.count)
# Round to nearest 5 for additional disclosure control
rounded = round(max(noisy_count, self.k_min) / 5) * 5
return AggregateCell(
dimension_values=cell.dimension_values,
count=int(rounded),
is_suppressed=False,
noise_applied=noisy_count - cell.count
)
def check_l_diversity(
self,
cells: List[AggregateCell],
sensitive_attribute_key: str,
l_min: int = 3
) -> List[str]:
warnings = []
for cell in cells:
if not cell.is_suppressed:
attr_val = cell.dimension_values.get(sensitive_attribute_key)
if attr_val is not None:
# In a real implementation, check that the equivalence class
# containing this cell has l_min distinct attribute values
warnings.append(
f"Cell {cell.dimension_values}: verify l-diversity "
f"(l≥{l_min}) for sensitive attribute '{sensitive_attribute_key}'"
)
return warnings
def assess_anonymisation(self, raw_count: int) -> dict:
return {
"is_anonymous": raw_count >= self.k_min,
"k_threshold": self.k_min,
"action": "publish_with_noise" if raw_count >= self.k_min else "suppress",
"epsilon": self.epsilon,
"sensitivity_level": self.sensitivity_level,
"regulatory_basis": "WP29 Opinion 05/2014 + EDPB Guidelines"
}
IP Address Handling
Rather than truncating IP addresses (which provides limited anonymisation), consider:
- For analytics: Store only the country/region derived from geolocation at collection time. Discard the IP address immediately after lookup. The derived geographic field is not directly re-identifying if sufficiently coarse.
- For security logging: Store full IP addresses as pseudonymised tokens using a rotating daily key. Keep the key separately. Treat security logs as pseudonymous personal data with a defined retention period (typically 30-90 days).
- For billing/fraud detection: Full IP retention with a legal basis (legitimate interest, contract performance) and a documented retention period.
Never: store raw IP addresses in analytics pipelines and call them "anonymised" after truncation.
Compliance Checklist — 25 Items
Data Classification (5 items)
- CL-01: Every dataset in your system has been classified as: identified personal data, pseudonymous personal data, anonymous data, or non-personal data.
- CL-02: All datasets classified as "anonymous" have been assessed against the Recital 26 re-identification test using the WP29 singling-out / linkability / inference triad.
- CL-03: Hashed email addresses, user IDs with lookup tables, and similar transformed fields are classified as pseudonymous (not anonymous).
- CL-04: Quasi-identifier combinations (age + location + occupation, or similar) have been checked for uniqueness risk using the Sweeney-style analysis.
- CL-05: The classification review has a scheduled frequency (at minimum annually, or when new fields are added to the dataset).
Pseudonymisation Implementation (7 items)
- PS-01: Pseudonymisation uses keyed HMAC (e.g., HMAC-SHA256) or encryption rather than plain hashing.
- PS-02: Pseudonymisation keys are stored separately from the pseudonymised data — different system, different access controls.
- PS-03: Keys are rotated on a defined schedule (daily for analytics tokens, annually for long-term identifiers).
- PS-04: Access to the key mapping is limited to specific roles and logged.
- PS-05: The pseudonymisation scheme is documented in your ROPA (Art.30) entry for the relevant processing activity.
- PS-06: Your data processing agreements (Art.28) specify that processors must not attempt to re-identify pseudonymous data.
- PS-07: Pseudonymised data is treated as personal data throughout — it is subject to data subject rights requests, retention schedules, and breach notification obligations.
Anonymisation Validation (6 items)
- AN-01: All datasets published or shared as "anonymous" have a documented anonymisation method with a named technique (k-anonymity, differential privacy, aggregation with suppression, or combination).
- AN-02: K-anonymity value is documented. K ≥ 5 for standard data; K ≥ 10 for employment/financial; K ≥ 20 for health/Art.9 data.
- AN-03: L-diversity is verified for any dataset where a sensitive attribute is retained alongside quasi-identifiers.
- AN-04: If differential privacy is used, ε is documented and is ≤ 1.0 for data published externally.
- AN-05: Aggregate cells below the minimum count threshold are suppressed, not rounded down to zero (which reveals the cell exists).
- AN-06: The anonymisation approach has been reviewed by a person with data protection expertise (internal DPO, external consultant, or legal review) at least once.
Breach Response (4 items)
- BR-01: Your incident response plan distinguishes between breaches of identified data, pseudonymous data, and anonymous data.
- BR-02: For pseudonymous data breaches, you have a documented process for assessing whether the attacker also obtained the key mapping.
- BR-03: If key mapping was not exposed, your Art.34 notification decision process documents the "low risk" rationale for not notifying data subjects.
- BR-04: Post-breach, pseudonymisation keys are rotated even if no key exposure is confirmed.
Research and Analytics (3 items)
- RA-01: Processing of pseudonymous data for internal analytics has a documented legal basis (typically legitimate interest with Art.6(4)(e) compatibility analysis).
- RA-02: If Art.89 derogations are relied upon for research or statistics, the pseudonymisation safeguards are explicitly documented in the processing record.
- RA-03: Data shared with third parties for analytics is either genuinely anonymous (validated per checklist above) or covered by an Art.28 processor agreement.
Infrastructure Jurisdiction and Pseudonymisation
For SaaS hosted on EU-sovereign infrastructure, pseudonymisation has an additional practical benefit: the key mapping and the pseudonymised data can be held entirely within the EU, and access from outside the EU can be restricted at the infrastructure level.
For SaaS using US-parent cloud providers, pseudonymisation does not resolve CLOUD Act exposure. The CLOUD Act allows US authorities to compel disclosure of data held by US persons or entities regardless of where the data is stored. A pseudonymous dataset held on AWS or Azure remains subject to potential CLOUD Act disclosure — and if the key mapping is also held there, the effective protection is limited. The CJEU in Schrems II (Case C-311/18) confirmed that technical pseudonymisation does not neutralise US intelligence law access risks where the key is accessible to the US entity.
EU-native infrastructure where neither the pseudonymised data nor the key mapping is ever processed by a US entity provides the strongest position under GDPR and removes the CLOUD Act overlap entirely.
Summary
| Identified Data | Pseudonymous Data | Anonymous Data | |
|---|---|---|---|
| GDPR applies? | Yes | Yes | No (Recital 26) |
| Data subject rights? | Full (Art.15-22) | Full | None |
| Breach notification? | Art.33 + Art.34 | Art.33 + Art.34 assessment | No obligation |
| Art.89 research derogation? | No (usually) | Yes (with safeguards) | N/A |
| Art.25 Privacy by Design credit? | No | Yes | N/A |
| Art.32 security measure credit? | No | Yes (reduced breach severity) | N/A |
| Typical techniques | — | HMAC-SHA256, encryption, tokenisation | k-anonymity + DP + suppression |
The practical takeaway: pseudonymisation is not a path out of GDPR, but it is a well-recognised tool that reduces breach severity, enables research derogations, and satisfies Privacy by Design obligations. True anonymisation is genuinely difficult, requires combining multiple techniques, and must withstand the Recital 26 re-identification test — including future technology developments.
Most SaaS developers should default to implementing robust pseudonymisation (keyed HMAC, separate key storage, key rotation) rather than attempting anonymisation that will not survive regulatory scrutiny. Reserve anonymisation efforts for specific use cases — external data sharing, research publication, aggregate reporting — where the full validation process is applied.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.