2026-05-03·14 min read·

AWS Entity Resolution EU Alternative 2026: Art.22 Automated Profiling Risk, CLOUD Act Record-Linkage Intelligence, and the Art.5(1)(b) Cross-Dataset Purpose Violation

Post #801 in the sota.io EU Compliance Series

AWS Entity Resolution is the managed record matching and entity linking service that helps organizations disambiguate and unify records about the same real-world entity — typically a customer — across multiple datasets. The value proposition addresses a universal data quality problem: a customer who signed up via mobile app has a different identifier than the same customer in your CRM, which has yet another identifier in your analytics warehouse, and a fourth in your marketing automation platform. Entity Resolution applies configurable matching rules and ML-based matching models to identify which records across these systems represent the same customer and generate a linked entity view.

For organizations building customer data platforms, improving marketing attribution, or consolidating customer records after a merger, the operational value is substantial. Entity Resolution reduces the manual deduplication effort that previously required custom ETL pipelines and domain-specific matching heuristics. It integrates with AWS Glue, Amazon S3, and AWS Clean Rooms to fit within existing data architecture patterns.

It is not in the AWS European Sovereign Cloud service catalog.

That ESC absence carries particular weight for Entity Resolution: a service whose entire purpose is to link records about real people — building unified customer entity graphs by connecting data collected under different contexts and legal bases — sits at the intersection of multiple GDPR obligations that AWS's data quality positioning does not surface. The compliance failures are structural, not configuration-dependent. Entity Resolution's design goal (maximum linkage accuracy across disparate data sources) is in direct tension with GDPR's foundational principles of purpose limitation, data minimization, and privacy by design.


What AWS Entity Resolution Does

AWS Entity Resolution provides automated record matching and entity linking across multiple data sources.

Core components:

Scale context: A retail organization with five data systems — e-commerce platform, loyalty program, in-store POS, customer service CRM, and email marketing platform — runs an Entity Resolution matching workflow that compares 2 million records across all five systems. The workflow identifies 1.4 million unique customers (resolving 2 million records to 1.4 million entities), produces a match graph with confidence scores for each identified match, and integrates the entity graph into their CDP for unified customer analytics and personalized marketing.

ESC status: AWS Entity Resolution is not in the AWS European Sovereign Cloud service catalog.


GDPR Issue 1 — Art. 22: Automated Entity Linking Creates Profiling Subject to Automated Decision-Making Obligations

GDPR Art. 22 restricts automated decision-making that produces legal or similarly significant effects on data subjects, and requires human review mechanisms where such automated decisions are made. AWS Entity Resolution's automated record linking creates customer profiles that drive significant decisions — marketing targeting, loyalty tier assignment, fraud scoring, service eligibility determination — without triggering the Art. 22 safeguards that organizations deploying it are required to implement.

How entity resolution creates Art. 22 exposure: The act of linking records about a data subject across multiple datasets is not itself a decision about that data subject. But the unified entity profile that entity resolution produces — the comprehensive view of a customer assembled by linking their transactions, service interactions, loyalty activity, and communication history — is the input to automated decisions that produce significant effects.

When an Entity Resolution workflow links a customer's e-commerce purchase history (high-value items, frequent returns) with their loyalty program behavior (point redemption patterns), their customer service history (complaint frequency and nature), and their email engagement data (open rates, click patterns), the resulting unified entity profile enables automated segmentation and targeting decisions that would not have been possible without the linkage. The customer who was invisible as a high-risk churn candidate in any single system becomes identifiable as such in the unified entity view — and automated marketing suppression, loyalty tier downgrade, or personalized retention offer decisions follow automatically.

The purpose limitation interaction: The Art. 22 analysis is compounded by the purpose limitation problem (see Issue 2 below). Data collected under separate legal bases — purchase history under Art. 6(1)(b) (contract performance), loyalty data under Art. 6(1)(a) (consent), service interactions under Art. 6(1)(f) (legitimate interest) — is now combined into an entity profile used for automated decisions. The legal basis that authorized each individual collection did not authorize the creation of a unified profile for automated decision-making purposes.

What Art. 22 requires: Where entity resolution outputs drive automated decisions with significant effects on data subjects, organizations must: inform data subjects that their records have been linked and that decisions affecting them are made on the basis of the unified profile, provide the right to obtain human review of those decisions, explain the logic of the automated decision-making (including the entity resolution criteria), and conduct a DPIA under Art. 35 for large-scale automated profiling. AWS Entity Resolution provides none of these mechanisms natively.


GDPR Issue 2 — CLOUD Act: Entity Graphs Are High-Value Business Intelligence

The CLOUD Act enables US law enforcement to compel US cloud providers to produce data regardless of geographic storage location. For AWS Entity Resolution, CLOUD Act compelled disclosure reaches a category of business intelligence that is qualitatively more sensitive than the raw records that were linked: the entity resolution graph that represents your organization's unified understanding of its customer base.

What the entity graph contains that raw records do not: Each individual source system contains a partial view of a customer's relationship with your organization. The entity resolution graph synthesizes these partial views into a complete relational map:

No geographic protection: The entity graphs stored in your AWS S3 buckets in EU regions are accessible to CLOUD Act compelled disclosure because they are stored by Amazon.com, Inc., a US-headquartered corporation. The sovereign cloud exception that exists for EU entities operating their own infrastructure does not apply to managed AWS services.


GDPR Art. 5(1)(b) requires that personal data be collected for specified, explicit, and legitimate purposes and not further processed in a manner incompatible with those purposes. AWS Entity Resolution's core function — linking records collected under different contexts and legal bases into a unified entity view — is a paradigmatic purpose limitation violation.

How records acquire incompatible purposes: A customer's personal data is collected under multiple legal bases across different organizational touchpoints. Each collection event specifies its own processing purpose:

Each data subject understood, at collection time, that their data would be used for the specified purpose in that context. The in-store customer providing their loyalty card for a purchase did not understand that their transaction record would be linked to their online account, their customer service history, and their email engagement data to create a unified profile.

The compatibility assessment failure: Art. 6(4) permits secondary processing for purposes incompatible with the original collection purpose if a compatibility assessment under Art. 6(4)(a)-(e) supports it. The factors include: any link between the original and secondary purposes, the context of collection, the nature of the personal data, possible consequences for data subjects, and existence of appropriate safeguards.

Entity resolution for customer analytics fails this assessment. The link between "process this purchase" and "build a unified customer profile linking all touchpoints for behavioral analytics and automated decision-making" is not apparent from the data subject's perspective. The consequences — automated targeting, tier assignments, and retention decisions based on cross-channel behavioral intelligence — are significant and were not disclosed at collection. The safeguards (consent to unified profiling under Art. 6(1)(a)) are rarely obtained before entity resolution workflows are run.

The cumulative effect problem: Each individual dataset may have been lawfully collected. The purpose limitation violation arises from the aggregation — the act of linking creates a new category of data (unified customer profile) that exceeds the purposes for which any individual dataset was collected. AWS Entity Resolution accelerates and automates this purpose limitation violation at scale.


GDPR Issue 4 — Art. 17: Cascading Erasure Failure Across Linked Entity Records

GDPR Art. 17 gives data subjects the right to erasure of their personal data. For organizations using AWS Entity Resolution, a data subject erasure request creates a multi-system deletion problem that the entity resolution architecture makes structurally harder to complete fully — and the entity graph itself may not be erasable without breaking the analytical infrastructure built on it.

The cascading deletion problem: When a customer submits a GDPR erasure request, the responsible organization must identify and delete all records containing that customer's personal data. For an organization using Entity Resolution, that customer's data exists in:

  1. Each source system (e-commerce, CRM, loyalty, POS, marketing automation)
  2. The S3 source files used as input to Entity Resolution matching workflows
  3. The Entity Resolution match output files in S3 (which contain matched record pairs with confidence scores)
  4. The resolved entity UUID assignments that link all the deleted records to the canonical entity identifier
  5. Downstream systems that consumed the entity graph (CDP, analytics warehouse, personalization platform)
  6. Historical match output files from previous workflow runs

The entity UUID persistence problem: After deleting all source records for a specific customer, the resolved entity UUID that was assigned to that customer may persist in downstream systems as a join key. Analytics queries, ML model training datasets, and BI dashboards that used the entity UUID as a customer identifier may retain that UUID in aggregated or derived form even after the source records are deleted. The UUID itself is not personal data in isolation, but if it can be re-linked to a customer's identity through remaining data, erasure is incomplete.

AWS Entity Resolution provides no deletion propagation: Entity Resolution does not track downstream consumers of entity graph outputs, does not provide a mechanism to identify all S3 locations where a specific entity's resolution output was stored, and does not provide an API to delete all resolution records associated with a specific entity UUID. Organizations must implement their own erasure propagation logic on top of the entity graph infrastructure — a non-trivial engineering task that is rarely completed correctly under time pressure when responding to erasure requests.

The match output retention gap: Entity Resolution match outputs are stored in S3 with whatever retention policy the organization configures for that bucket. Unlike source operational data with defined retention schedules, entity graph outputs are often treated as infrastructure artifacts with no defined retention period. They accumulate across matching workflow runs and are rarely subject to the same data lifecycle management as the source records they were derived from.


GDPR Issue 5 — Art. 25: Privacy-by-Design Is Incompatible with Maximum-Linkage Design

GDPR Art. 25 requires data protection by design and by default — that privacy is built into processing activities from the outset. AWS Entity Resolution's design goal is to maximize accurate record linkage across datasets. This design goal is in structural tension with the data minimization, purpose limitation, and storage limitation principles that Art. 25 requires controllers to implement.

What privacy-by-design requires for entity resolution: A privacy-by-design entity resolution implementation would apply the minimum necessary matching attributes to achieve the defined use case purpose, pseudonymize source records before matching (comparing hashed identifiers rather than cleartext personal data where possible), generate entity linkages scoped to specific downstream use cases rather than building a universal customer identity graph, implement automatic entity graph expiry aligned with the retention periods of the source records, and provide audit trails for which data subjects' records were linked and under which legal basis.

What Entity Resolution does instead: Entity Resolution's value is predicated on comprehensive attribute comparison across multiple datasets to maximize match accuracy. Using fewer attributes reduces match recall — records that represent the same entity fail to link because the limited attribute set did not identify the match. The service's ML matching models improve with more attributes and more diverse training signal. Its Clean Rooms integration enables expanding the linkage surface to partner data. Its ID namespace integration enables linking internal customer records to advertising ecosystem identifiers. Each of these capabilities expands the data linkage scope — the opposite of the data minimization that Art. 25 requires.

The universal identity graph anti-pattern: Best practice in entity resolution for GDPR compliance is to build purpose-scoped entity resolution — link records only across the datasets necessary for a specific, defined use case, with a specific legal basis. A purpose-scoped approach links e-commerce and returns records for contract performance purposes, but does not automatically link those records to loyalty program data or advertising IDs.

AWS Entity Resolution's architecture encourages the universal identity graph pattern: configure all data sources, maximize attribute coverage, generate a universal entity UUID that downstream systems use as a customer join key. This universal identity graph pattern is architecturally incompatible with purpose limitation and Art. 25 privacy by design, because it aggregates all known attributes about each customer into a single unified entity without scoping the aggregation to specific, documented purposes with specific legal bases.


EU-Sovereign Alternatives to AWS Entity Resolution

The core functionality of Entity Resolution — matching records about the same entity across datasets — is achievable through open-source tooling deployed on EU infrastructure, providing data residency and sovereignty guarantees that AWS Entity Resolution cannot offer.

Zingg (Self-Hosted Entity Resolution)

Zingg is an open-source entity resolution framework built on Apache Spark. It provides ML-based fuzzy matching across large datasets with support for custom matching rules, configurable similarity thresholds, and iterative model training on labeled match/non-match pairs. Zingg can be deployed on EU-hosted Kubernetes clusters (Hetzner, OVH, Scaleway) or EU-hosted EMR-equivalent Spark clusters.

Privacy-by-design advantages: Zingg's configuration is explicit — you define exactly which fields are compared, which similarity functions are applied, and what confidence threshold constitutes a match. There is no automatic schema discovery or automatic attribute expansion. You can implement pre-matching pseudonymization by hashing PII fields (comparing SHA-256 hashes of email addresses rather than cleartext emails for deterministic matching) before any data enters the matching pipeline. Match output is stored in your own infrastructure with retention policies you define.

Zingg includes a labeling workflow for supervised learning — you review candidate matches to train the model, which keeps human review in the loop for the ML matching component and satisfies Art. 22 documentation requirements.

Splink (Python Record Linkage Library)

Splink is an open-source Python library developed by the UK Ministry of Justice for probabilistic record linkage at scale. It implements the Fellegi-Sunter statistical model for record linkage with Expectation-Maximization parameter estimation, provides intuitive model diagnostics, and supports multiple backends including DuckDB (for single-node workloads), Spark (for distributed processing), and SQLite.

Splink runs entirely in your own infrastructure — there is no external service dependency. It produces match probability scores for record pairs and provides explainability features (feature-level contribution to match probability) that support Art. 22 documentation requirements. The library's academic foundation in the Fellegi-Sunter model means its matching logic is auditable, documentable, and reviewable by DPAs if required.

Splink's DuckDB backend makes it particularly accessible for organizations that do not have Spark infrastructure: record linkage across datasets of up to ~10 million records is achievable on a single VM with adequate RAM using DuckDB's columnar processing engine.

Apache Spark Record Linkage (Custom Pipeline)

For organizations with existing Spark infrastructure, implementing entity resolution as a custom Spark pipeline provides maximum control over the matching logic, the data minimization implementation, and the output storage model. The GraphFrames library provides connected component analysis for resolving entity clusters from match pairs. Blocking techniques (MinHash LSH for approximate nearest-neighbor matching, sorted neighborhood blocking for address-based matching) make large-scale comparison computationally tractable.

A custom Spark implementation allows full implementation of privacy-by-design patterns: pre-match pseudonymization, purpose-scoped match graph outputs (separate match graphs for each use case rather than a universal entity UUID), integration with your existing data lifecycle management for retention and erasure, and audit logging at the record-pair level for Art. 30 ROPA documentation.


Transition Approach

Moving from AWS Entity Resolution to EU-sovereign entity resolution requires addressing both the technical migration and the GDPR compliance architecture that should have been in place from the start.

Phase 1 — Purpose Scoping Audit (1-2 weeks): Before migrating tooling, document each entity resolution use case with its own legal basis, purpose, and data minimization scope. Identify which datasets should be linked for which purposes. Determine which existing entity resolution workflows produce universal identity graphs that must be decomposed into purpose-scoped alternatives.

Phase 2 — Tool Deployment + Baseline Migration (2-3 weeks): Deploy Zingg or Splink on EU infrastructure. Migrate the highest-priority matching workflow first, implementing pre-match pseudonymization and purpose-scoped output storage. Validate match quality against the Entity Resolution baseline to confirm no significant recall degradation.

Phase 3 — Erasure Integration + Full Migration (3-4 weeks): Implement erasure propagation logic for the entity graph outputs (delete all match records associated with a specific entity UUID when a GDPR erasure request is processed). Migrate remaining workflows progressively. Implement entity graph retention schedules aligned with source record retention.

The migration is technically straightforward; the compliance architecture work (purpose scoping, erasure integration, DPIA documentation) is the primary investment.


Conclusion

AWS Entity Resolution addresses a real data quality problem — customer record fragmentation across systems is one of the most common obstacles to accurate analytics and personalized customer experiences. The Art.22, CLOUD Act, Art.5(1)(b), Art.17, and Art.25 compliance failures documented in this guide are not arguments against entity resolution as a technique. They are arguments against building your customer identity infrastructure on a US-jurisdiction managed service that is designed to maximize data linkage and sits outside the AWS European Sovereign Cloud catalog.

The open-source alternatives — Zingg for Spark-based large-scale matching, Splink for Python-native probabilistic linkage — deliver comparable matching capabilities deployed on EU infrastructure under your full operational control. The compliance architecture work required for privacy-by-design entity resolution (purpose scoping, pre-match pseudonymization, erasure integration) is required regardless of which tooling you choose. Building that architecture on EU-sovereign infrastructure ensures that the resulting customer intelligence never comes under US-jurisdiction control.


Part of the sota.io EU Compliance Series — 801 posts analyzing GDPR compliance gaps in cloud services used by European developers and organizations. Explore the full series or deploy on sota.io — EU-native infrastructure with no US-parent jurisdiction.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.