AWS Macie EU Alternative 2026: S3 Data Classification, PII Detection, and GDPR Under the CLOUD Act
Post #743 in the sota.io EU Compliance Series
AWS Macie is Amazon's fully managed data security service that uses machine learning to automatically discover, classify, and protect sensitive data stored in Amazon S3. Security teams use it to audit data lakes for GDPR-regulated personal information. Healthcare organizations deploy it to identify PHI scattered across storage buckets. Financial institutions run it against transaction archives to find credit card numbers, IBAN codes, and account identifiers. Compliance officers rely on it to generate findings reports that feed into Art.30 Records of Processing Activities.
The fundamental tension: Macie is designed to identify exactly where your most sensitive data lives — and then it reports those findings to infrastructure that remains under US jurisdiction. Understanding how this creates GDPR exposure requires looking at what Macie actually does with the data it discovers, not just where your S3 buckets are located.
What AWS Macie Actually Does
When you enable Macie on an S3 bucket, the service continuously samples and analyzes objects using a combination of managed data identifiers and custom detection rules. Macie classifies objects by sensitivity level, generates findings for objects that match sensitive data patterns, and stores those findings in a managed findings repository accessible through the AWS console, API, and Amazon EventBridge.
The managed data identifiers cover over 100 sensitive data types across multiple categories: financial information (credit card numbers, bank account numbers, SWIFT codes, IBAN codes), personally identifiable information (passport numbers, national IDs, tax identifiers), protected health information (ICD-10 codes, NPI numbers, DEA registration numbers), and credentials (AWS access keys, private keys, authentication tokens).
For European organizations, the relevant managed identifiers include EU-specific formats: German Personalausweis numbers, French NIR (national identification), Dutch BSN numbers, Italian codice fiscale, Spanish DNI, and dozens of other national ID formats. Macie can detect these patterns inside S3 objects including CSV files, JSON documents, Parquet datasets, Office documents, and PDF files.
Findings include the bucket name, object key, the specific sensitive data types detected, the count of occurrences, and sample values — partial representations of the actual detected data. These findings are stored and queryable. Amazon EventBridge can route findings to SIEM systems, Lambda functions, or Security Hub for automated response workflows.
The GDPR Problem: Finding PII Creates New PII
Here is the first structural GDPR tension with Macie: the act of discovering sensitive data creates a new dataset — the findings themselves — that contains information about where personal data is stored, what types are present, and how many occurrences exist.
Under GDPR Art.5(2) accountability principle, organizations must be able to demonstrate what personal data they process and where. Macie findings are meant to support this. But findings stored in AWS infrastructure introduce their own Art.30 processing requirement: you are now processing data about your personal data processing activities, and that meta-dataset is subject to GDPR in its own right.
The findings repository is managed by AWS and accessible to AWS under US law. A CLOUD Act request targeting your AWS account can compel disclosure of Macie findings — which amounts to a map of where every piece of GDPR-regulated personal information in your S3 estate is located, organized by data type and bucket.
For an attacker or regulator, access to Macie findings is arguably more valuable than access to the underlying S3 data, because findings function as an index: they identify the buckets, objects, and data types without requiring processing of the raw data.
GDPR Exposure Point 1: Art.9 Special Category Data — The Surveillance Paradox
GDPR Art.9 prohibits processing special categories of personal data (health data, biometric data, racial or ethnic origin, political opinions, religious beliefs, trade union membership, genetic data, sex life or sexual orientation data) without explicit legal basis. The prohibition applies not just to the underlying data but to any processing of that data.
AWS Macie's managed data identifiers include specific detection patterns for health-related information: ICD-10 diagnostic codes, drug names from the WHO anatomical therapeutic chemical classification, NPI numbers (National Provider Identifiers for US healthcare), DEA registration numbers, and insurance policy identifiers. Macie also detects financial information that, in combination with other data, can reveal special category attributes — for example, religious donation amounts to identified organizations, or union membership fees to identified trade unions.
The paradox: You may deploy Macie specifically to find Art.9 data that was incorrectly stored without appropriate safeguards. In doing so, you run Art.9 data through AWS's ML classification infrastructure, and the findings documenting what Art.9 data exists (and where) flow through US-jurisdiction systems.
The legal basis question becomes recursive: what legal basis covers the Macie processing of Art.9 data? Legitimate interest (Art.6(1)(f)) cannot serve as a basis for Art.9 processing. The processing exception that most plausibly applies is Art.9(2)(j) — processing necessary for archiving purposes in the public interest, or scientific or historical research — but this exception does not fit routine commercial data discovery workflows.
Art.9(2)(b) covers processing necessary for carrying out obligations and exercising rights in employment, social security, and social protection law, with appropriate safeguards. Art.9(2)(f) covers processing necessary for establishment, exercise, or defense of legal claims. Neither maps cleanly to "we are scanning our S3 buckets to find health data we stored incorrectly."
For organizations processing health data, financial data with health implications, or any special category data, Macie represents an additional Art.9 processing activity that requires its own legal basis documentation.
GDPR Exposure Point 2: Art.17 Right to Erasure — The Findings Outlast the Data
GDPR Art.17 grants data subjects the right to erasure of personal data without undue delay when the data is no longer necessary for the purpose it was collected, when consent is withdrawn, or when the subject objects under Art.21.
An organization responding to an erasure request identifies the relevant S3 objects, deletes them from the bucket, and considers the obligation fulfilled. But Macie findings persist independently of the underlying objects. A finding documenting that object s3://company-data/customers/customer-12345.csv contained 47 occurrences of IBAN codes and 3 occurrences of national ID numbers continues to exist in the Macie findings repository after the underlying CSV is deleted.
This creates two problems:
Problem A — Findings as residual personal data: The finding contains the partial sample values Macie extracted during classification. These partial values may themselves constitute personal data — a truncated IBAN code, a partial national ID number, a fragment of a health record identifier. Under Art.17, the subject has the right to erasure of this residual data too, but erasure of Macie findings requires separate API calls to the Macie findings repository, documented in a different system entirely.
Problem B — S3 versioning incompleteness: Many S3 configurations use versioning to protect against accidental deletion. When an object is "deleted" in a versioned bucket, a delete marker is added but all previous versions are retained. Macie can classify the current version, but if versioning is enabled and an erasure request is fulfilled by adding a delete marker, the personal data remains in previous versions. Macie findings may continue to reference objects that appear deleted from a bucket-listing perspective but remain accessible to anyone with appropriate S3 API access.
Complete Art.17 compliance for S3 data requires: deleting all versions of relevant objects (or object versions containing the relevant data), purging Macie findings that reference those objects, and verifying that Macie is not re-classifying data that exists in replication targets. For S3 Cross-Region Replication configurations, Macie in the source region may not have visibility into the replication target, and a separate Macie configuration may be required in each target region — each with its own findings repository.
GDPR Exposure Point 3: Art.5(1)(c) Data Minimisation — Macie Is Reactive, Not Preventive
GDPR Art.5(1)(c) requires that personal data be adequate, relevant, and limited to what is necessary in relation to the purposes for which it is processed — the data minimisation principle.
AWS Macie's architecture is fundamentally reactive: it discovers sensitive data after it has already been uploaded to S3. Macie does not prevent sensitive data from entering storage; it identifies where sensitive data has already accumulated.
This reactive architecture is appropriate for auditing legacy data estates and discovering data that should not have been stored. But it creates a compliance gap: between the time sensitive data is uploaded and the time Macie discovers and generates a finding, the data exists in S3 without documented data minimisation controls.
For high-volume data ingestion pipelines — streaming customer event data, log aggregation, batch ETL pipelines — there may be hours or days between upload and Macie classification. During this window, personal data exceeding minimisation requirements is stored without the compensating controls that Art.5(1)(c) requires.
More significantly, Macie's findings-based workflow puts the burden on humans to review findings and take remediation action. The human review latency between finding generation and actual data deletion or reclassification can be weeks, particularly in large organizations with complex approval workflows. Art.5(1)(c) does not permit indefinite retention of excess personal data while findings queue awaits human action.
A Macie deployment that generates findings but lacks automated remediation workflows — automatic deletion of incorrectly stored sensitive data, automatic reclassification and access restriction, automatic notification to the data owner for review — does not satisfy Art.5(1)(c) at the process level, even if the technology is correctly configured.
GDPR Exposure Point 4: CLOUD Act — Findings as an Intelligence Map
The US CLOUD Act (Clarifying Lawful Overseas Use of Data Act, 2018) enables US law enforcement to obtain data stored by US-based cloud providers under warrant or court order, regardless of where the data is physically stored. Amazon Web Services is a US-based provider. Macie findings stored in AWS infrastructure are accessible under CLOUD Act orders.
The intelligence value of Macie findings under a CLOUD Act request is significant. A findings export for a large organization's S3 estate functions as a structured inventory: which buckets contain what categories of sensitive data, how many occurrences, which objects are highest risk, and what data types are present. This inventory answers questions that would otherwise require extensive forensic work or production of raw data at scale.
For organizations subject to GDPR Art.44-49 (restrictions on transfers to third countries), the CLOUD Act creates a structural tension. The EU-US Data Privacy Framework (DPF) provides a mechanism for transfers to certified US organizations, but the DPF does not override CLOUD Act authority — US law enforcement can compel production of DPF-covered data under CLOUD Act. The Schrems II concerns that invalidated Privacy Shield (CJEU C-311/18) were not fully resolved by the DPF's adoption, and a future legal challenge could invalidate DPF as well.
For organizations processing data for EU public authorities, defense contractors, critical infrastructure operators, or others with heightened sovereignty requirements, the CLOUD Act exposure of Macie findings — a map of sensitive data locations — represents a meaningful intelligence risk independent of access to the underlying data.
GDPR Exposure Point 5: Art.30 Records of Processing — Circular Documentation
GDPR Art.30 requires controllers and processors to maintain records of processing activities. Macie is marketed as a tool that supports Art.30 compliance by inventorying what personal data exists and where. But Macie findings create their own Art.30 entry: the Macie service itself is processing personal data (or at minimum, data about personal data processing) and requires documentation.
The Art.30 entry for Macie must include:
- Purposes of processing: data security, GDPR compliance auditing, sensitive data discovery
- Categories of data subjects and personal data: the sensitive data types that Macie classifies (financial data, health data, national IDs, credentials)
- Recipients: AWS (as processor), EventBridge targets, Security Hub, any SIEM systems receiving findings
- Third country transfers: findings are processed by AWS infrastructure under US jurisdiction; this constitutes a transfer requiring documentation and legal basis under Art.44-49
- Retention periods: Macie retains findings for 90 days by default; export to S3 for longer retention creates additional entries
- Technical and organizational security measures
Many organizations deploying Macie to improve their Art.30 compliance documentation fail to add Macie itself to their Art.30 records — creating a gap in the documentation that Macie is supposed to help close.
GDPR Exposure Point 6: Art.25 Privacy by Design — Scanning as the Wrong Abstraction Layer
GDPR Art.25 requires data protection by design and by default. Privacy-by-design principles call for embedding data protection into system architecture rather than adding it as an overlay. The Art.25(1) standard requires appropriate technical and organizational measures to implement data protection principles effectively at the time of the processing.
Macie operates at the storage layer after data has been collected, transmitted, stored, and potentially processed. It is a classification and discovery tool, not a prevention mechanism. From an Art.25 perspective, the appropriate abstraction layer for data minimisation controls is upstream of storage — at ingestion, collection, or processing points where personal data can be evaluated before it is written to persistent storage.
A pure Macie deployment — scan what already exists in S3, generate findings, remediate after discovery — represents a detective control rather than a preventive one. Art.25's "data protection by design" standard implies that appropriate controls prevent excess personal data from being stored, rather than discovering it after the fact.
This does not mean Macie is incompatible with GDPR. Combined with preventive controls (input validation at ingestion, tokenization at processing, classification enforcement at upload), Macie provides meaningful defense-in-depth. But deploying Macie as a primary Art.25 compliance mechanism — rather than as an audit and discovery layer supplementing upstream controls — does not meet the by-design standard.
EU-Native Alternatives for Data Classification and PII Detection
Building a GDPR-compliant data discovery and classification stack from EU-native or open-source components eliminates the CLOUD Act exposure and provides architectural control over findings data.
Apache Ranger
Apache Ranger provides centralized security administration for Hadoop ecosystem data stores (HDFS, Hive, HBase) and increasingly for cloud object storage. Ranger's classification engine can tag data at the column, table, or file level with sensitivity labels, and policy enforcement prevents access to classified data without appropriate authorization.
For S3-compatible object storage, Ranger integrates with MinIO (which implements the S3 API) to provide attribute-based access control policies. A Ranger policy can enforce that any object tagged as containing special category data under Art.9 requires explicit data steward approval before access. Classification metadata stays within your infrastructure — no external findings service required.
Ranger's approach inverts the Macie model: rather than discovering what data exists, Ranger enforces policies on data access and requires classification at write time. This aligns more closely with the Art.25 privacy-by-design standard.
Open Policy Agent (OPA)
Open Policy Agent is a general-purpose policy engine that can enforce authorization decisions across APIs, object storage, Kubernetes, and microservices. For data classification, OPA can evaluate incoming data requests against classification policies and enforce data minimisation rules at the point of ingestion.
A pattern that addresses the Macie reactive-discovery gap: use OPA as an ingestion gateway that evaluates data before it reaches object storage. OPA policies can require that all objects include classification metadata at write time, reject objects containing detected PII patterns without explicit authorization, and enforce retention limits based on data classification.
OPA policies are defined in Rego (Open Policy Agent's native language) and can be tested and versioned independently of the systems they protect. OPA runs on-premises, in your EU cloud environment, or in Kubernetes. No data leaves your infrastructure for classification or policy evaluation.
MinIO with Information Lifecycle Management
MinIO is an S3-compatible object storage system designed for high-performance workloads. MinIO ILM (Information Lifecycle Management) provides policy-based object lifecycle management: automatic deletion of objects after configurable retention periods, transition to different storage tiers, and object locking for compliance use cases.
For GDPR Art.17 erasure, MinIO's approach to object lifecycle is more tractable than S3 versioning: MinIO ILM can enforce hard deletion deadlines, making it straightforward to implement retention schedules that satisfy Art.5(1)(e) storage limitation requirements. Because MinIO runs in your infrastructure, there is no separate findings service — lifecycle rules and audit logs remain under your control.
MinIO's WORM (Write Once Read Many) object locking provides tamper-evident storage for records that must be retained (for example, legal holds or regulatory archives), while separately enforcing deletion deadlines for data subject to erasure rights.
Presidio (Microsoft open-source, self-hosted)
Microsoft Presidio is an open-source data anonymization SDK (GitHub: microsoft/presidio) that provides PII detection and anonymization for text data. Presidio supports 50+ PII entity types using a combination of rule-based recognizers and ML models (spaCy, Transformers).
Deployed self-hosted in an EU environment, Presidio provides Macie-equivalent PII detection without sending findings or sample data to external infrastructure. Presidio supports anonymization operations directly — replacing detected PII with tokens, pseudonyms, or redacted markers — which enables a more complete Art.5(1)(c) minimisation workflow than Macie's findings-only approach.
Presidio integrates with Python data pipelines, Spark workloads, and batch processing frameworks. For S3-equivalent object storage workflows, Presidio can be deployed as a processing step in your ingestion pipeline, classifying and transforming data before it reaches persistent storage.
Ceph with Object Gateway Classification Hooks
Ceph is an open-source distributed storage system with an S3-compatible object gateway (Ceph RGW). Ceph supports lifecycle policies, bucket notifications, and gateway hooks that can trigger classification workflows at upload time.
By deploying Ceph RGW with a classification service that evaluates objects before acknowledgment, organizations can implement preventive PII detection at the storage layer — rejecting or transforming sensitive objects before they are committed to storage. This preventive approach satisfies Art.25 privacy-by-design requirements more directly than a post-hoc scanning model.
Ceph runs on commodity hardware or in EU cloud environments. The Ceph community maintains packages for all major Linux distributions, and managed Ceph deployments are available from EU-based storage providers.
Migration Architecture: From Macie to Self-Hosted Data Classification
Phase 1: Export and Preserve Existing Findings
Before disabling Macie, export all existing findings to a self-managed location. Use the Macie API to export findings in JSONL format, preserving the classification history of your S3 estate. This export becomes your baseline for the migration period and satisfies Art.30 documentation requirements for the period during which Macie was active.
Create an inventory of all S3 buckets currently covered by Macie, noting which buckets have findings and what data types are represented. This inventory drives prioritization for the next phase.
import boto3
import json
from datetime import datetime
macie = boto3.client('macie2')
s3 = boto3.client('s3')
def export_findings_to_s3(export_bucket: str, prefix: str):
paginator = macie.get_paginator('list_findings')
findings_ids = []
for page in paginator.paginate():
findings_ids.extend(page.get('findingIds', []))
# Fetch in batches of 50 (Macie API limit)
all_findings = []
for i in range(0, len(findings_ids), 50):
batch = findings_ids[i:i+50]
response = macie.get_findings(findingIds=batch)
all_findings.extend(response.get('findings', []))
export_key = f"{prefix}/macie-findings-export-{datetime.utcnow().date()}.jsonl"
s3.put_object(
Bucket=export_bucket,
Key=export_key,
Body='\n'.join(json.dumps(f) for f in all_findings).encode()
)
return len(all_findings)
Phase 2: Deploy Self-Hosted Classification
Stand up your replacement classification stack in parallel with Macie. For a MinIO + Presidio deployment:
Infrastructure (Kubernetes/Docker Compose):
# classification-stack.yaml
services:
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
volumes:
- minio-data:/data
ports:
- "9000:9000"
- "9001:9001"
presidio-analyzer:
image: mcr.microsoft.com/presidio-analyzer:latest
ports:
- "5001:3000"
presidio-anonymizer:
image: mcr.microsoft.com/presidio-anonymizer:latest
ports:
- "5002:3000"
classification-worker:
build: ./classification-worker
environment:
MINIO_ENDPOINT: minio:9000
PRESIDIO_ANALYZER_URL: http://presidio-analyzer:3000
PRESIDIO_ANONYMIZER_URL: http://presidio-anonymizer:3000
depends_on:
- minio
- presidio-analyzer
Classification Worker (processes S3 events):
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
import minio
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def classify_object(bucket: str, key: str, content: str) -> dict:
results = analyzer.analyze(
text=content,
entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "IBAN_CODE",
"CREDIT_CARD", "NRP", "LOCATION", "MEDICAL_LICENSE"],
language="de", # or "en", "fr", etc.
)
sensitivity_level = "public"
if any(r.entity_type in ["MEDICAL_LICENSE", "NRP"] for r in results):
sensitivity_level = "art9-special-category"
elif any(r.entity_type in ["CREDIT_CARD", "IBAN_CODE"] for r in results):
sensitivity_level = "financial"
elif results:
sensitivity_level = "personal"
return {
"bucket": bucket,
"key": key,
"sensitivity_level": sensitivity_level,
"entity_types": list(set(r.entity_type for r in results)),
"occurrence_count": len(results),
"classified_at": datetime.utcnow().isoformat(),
}
Phase 3: Validate and Cut Over
Run both Macie and the self-hosted classifier in parallel for two to four weeks, comparing findings for the same S3 objects. Discrepancies reveal gaps in your custom recognizer coverage. The Presidio language model can be supplemented with custom recognizers for EU-specific ID formats not covered by default:
from presidio_analyzer import PatternRecognizer, Pattern
# German Steuer-ID (Steueridentifikationsnummer)
german_steuer_id_recognizer = PatternRecognizer(
supported_entity="DE_STEUER_ID",
patterns=[
Pattern(
"German Steuer-ID",
r"\b[1-9][0-9]{2}\s?[0-9]{4}\s?[0-9]{4}\b",
0.85,
)
],
supported_language="de",
)
# Dutch BSN (Burgerservicenummer)
dutch_bsn_recognizer = PatternRecognizer(
supported_entity="NL_BSN",
patterns=[
Pattern(
"Dutch BSN",
r"\b[0-9]{9}\b",
0.6,
)
],
supported_language="nl",
)
After validation, disable Macie on a per-bucket basis (lowest-risk buckets first), confirming that the self-hosted classifier provides equivalent coverage before proceeding to higher-sensitivity buckets.
Cost Comparison
AWS Macie pricing is based on two components: the amount of data evaluated for sensitive data discovery (per GB of S3 objects processed) and the number of S3 buckets monitored. For a data lake with 50TB of S3 data and 200 buckets, Macie costs approximately $1,500-3,000 per month depending on object scanning frequency.
A self-hosted Presidio + MinIO stack on a single EU cloud VM with 16 cores and 64GB RAM can process several terabytes of S3-equivalent data per day. At EU cloud VM pricing (roughly €0.40-0.80 per core-hour), equivalent classification capacity costs €150-400 per month — 80-90% cost reduction compared to Macie, without the CLOUD Act exposure.
For organizations with moderate data volumes (under 5TB), the cost difference is smaller, but the sovereignty benefit remains: classification metadata stays in your infrastructure, findings never leave your jurisdiction, and you retain complete control over the classification logic.
What sota.io Provides
sota.io is an EU-native PaaS platform built specifically for developers and organizations that cannot afford CLOUD Act exposure. Your applications, data, and infrastructure run exclusively in EU jurisdiction without US-parent company access.
For data classification workloads, this means you can deploy Presidio, MinIO, Apache Ranger, or OPA on sota.io infrastructure without sending findings, classification results, or sensitive data samples outside EU jurisdiction. You get the Macie-equivalent functionality — automated discovery, classification, and sensitive data inventory — with full data sovereignty.
The key difference from AWS Macie: when the Presidio classifier identifies an Art.9 health record in your storage layer, that finding stays in your infrastructure. There is no US-jurisdiction service that received a sample of the detected content. No CLOUD Act order can reach your classification results because they never touched US infrastructure.
sota.io handles the infrastructure so your team can focus on building the right classification policies for your GDPR obligations, not on managing Kubernetes clusters or configuring distributed storage replication.
Conclusion
AWS Macie solves a real problem: sensitive data accumulates in S3 buckets and needs systematic discovery and classification. The GDPR tension is not that Macie does this badly — it is that Macie's architecture creates new compliance obligations while addressing existing ones.
The six exposure points — Art.9 processing of special category data during classification, Art.17 erasure gaps when findings outlast deleted objects, Art.5(1)(c) reactive-rather-than-preventive minimisation, CLOUD Act access to findings as a sensitive data map, Art.30 circular documentation obligations, and Art.25 privacy-by-design limitations — are structural consequences of running sensitive data classification through US-jurisdiction infrastructure.
EU-native alternatives (Presidio, Apache Ranger, OPA, MinIO ILM, self-hosted Ceph) address these exposure points by keeping classification logic, findings, and metadata in infrastructure where CLOUD Act authority does not apply. The migration is technically straightforward and cost-effective. The compliance benefit is durable: no future change in US-EU data transfer agreements can retroactively expose classification findings that were never processed outside EU jurisdiction.
For organizations building GDPR-compliant data platforms in 2026, the question is not whether to classify sensitive data — Art.30 and Art.25 make this necessary. The question is whether your classification infrastructure creates additional exposure while addressing existing obligations.
sota.io is an EU-native PaaS platform. Deploy Presidio, MinIO, Apache Ranger, or any open-source data classification stack on infrastructure that stays under EU jurisdiction — no US-parent company, no CLOUD Act exposure.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.