2026-05-05·13 min read·
AWS Macie is designed to solve a genuine GDPR problem: EU developers struggle to know where personal data lives inside their S3 buckets, object stores, and data lakes. Art.30 requires a record of processing activities. Art.25 requires privacy by design. Art.5(1)(c) requires data minimisation. You cannot achieve any of these without first knowing where your PII is. Macie promises to solve this with machine learning. It scans your buckets, classifies sensitive data, and generates findings you can map to Art.30 categories. On paper, it is exactly what a GDPR compliance team needs. The problem is structural: to find PII in your data, Macie must process that PII inside AWS infrastructure that is incorporated in the United States and therefore subject to the CLOUD Act. The tool you use to discover privacy problems creates a new one. ## What AWS Macie Actually Does Macie is an AWS managed service that uses machine learning and pattern matching to automatically discover, classify, and protect sensitive data stored in Amazon S3. It detects over 100 types of sensitive data including names, addresses, credit card numbers, passport numbers, IBAN codes, health record identifiers, and custom data types you define. Macie works by ingesting a representative sample of every object in your monitored S3 buckets. It runs that sample through a classification model hosted in AWS infrastructure. When it finds sensitive data, it generates a "finding" that includes: - The S3 bucket and object path - The type of sensitive data detected - Excerpts showing the sensitive data in context (by default, this is enabled) - A severity score and count of occurrences Those findings are stored in your AWS account but processed, transmitted, and temporarily held in US-jurisdiction infrastructure during classification. ## The Five GDPR Exposure Points ### 1. Art.28 — Data Processor Without Adequate CLOUD Act Carve-Out When you enable Macie on an S3 bucket, AWS becomes a data processor under Art.28. You must have a Data Processing Agreement. AWS provides one. The problem is what the DPA cannot cover. CLOUD Act Section 2713 (18 U.S.C. § 2713) allows US law enforcement to compel production of data from US-incorporated cloud providers regardless of where that data is physically stored. AWS is incorporated in the United States. A valid CLOUD Act order issued to Amazon would require Amazon to produce data from your Macie scan — including the PII excerpts in findings — even if your S3 buckets sit in eu-west-1. Art.48 of GDPR states that international data transfers triggered by foreign court orders are only permissible if based on an international agreement between the EU and the requesting country. No such agreement exists with the United States for CLOUD Act orders. AWS cannot contractually cure this gap. The CLOUD Act override is a US statutory obligation that supersedes private contracts. Your Art.28 DPA with AWS is valid until a CLOUD Act order arrives. At that point, the DPA's protections dissolve. ### 2. Art.35 — DPIA Mandatory for Systematic PII Processing at Scale Under Art.35(3)(b), a DPIA is mandatory when processing involves "systematic monitoring of a publicly accessible area on a large scale." Recital 91 extends this to any processing that involves "profiling" or "systematic processing of sensitive data." Macie performs systematic, automated, large-scale processing of sensitive data. If you run Macie on production S3 buckets that contain customer data — which is the entire point of using it — you are conducting systematic sensitive data processing as defined by Art.35. The DPIA must assess the risks of the processing. Those risks include CLOUD Act compelled disclosure. Your DPIA must document that risk and your mitigation. "We use Macie because AWS says it is GDPR compliant" is not a valid DPIA. The DPIA must show you have assessed the risk and accepted residual risk in writing — with sign-off from your DPO. Most teams that deploy Macie for GDPR compliance do not have an Art.35 DPIA covering Macie itself. They use a GDPR tool without GDPR-compliant governance of that tool. ### 3. Art.5(1)(b) — Purpose Limitation and Model Training Ambiguity AWS states that customer data is not used to train Macie's underlying ML models by default. This is an opt-out guarantee, not an architectural one. The data still flows through the model inference pipeline. The deeper purpose limitation issue is internal: when Macie finds PII in an S3 bucket, it generates findings that contain excerpts of that PII. Those findings are stored in an AWS-managed findings repository. If your compliance team queries those findings, searches for patterns across findings, or exports them to a SIEM — that is a new processing purpose that the original data subjects did not consent to. Your customer uploaded data for one purpose. Macie creates a parallel universe of findings that contain that customer's PII, stored separately, accessible to different IAM roles, subject to different retention rules, and potentially exported to third-party SIEM tools. Each of these is a new processing purpose requiring a new lawful basis. ### 4. Art.25 — Privacy by Design Requires Not Exporting PII Excerpts Art.25 requires that you implement technical measures that give effect to data minimisation principles. Macie's default configuration includes PII excerpts in findings. This means the sensitive data you are trying to discover and protect is copied into findings objects stored in S3 and viewable in the AWS Console. A privacy-by-design approach would detect the presence of PII without retaining excerpts. Macie allows you to disable excerpt inclusion, but this is not the default. Most deployments follow the default. You end up with a second copy of PII fragments stored in your findings bucket, typically with less access control than your primary data. Disabling excerpts reduces Macie's usefulness for triage (you cannot see what it found without going back to the source). It is a genuine tension between Art.25 minimisation and the operational utility of the tool. ### 5. Art.5(1)(e) — Storage Limitation for Findings Macie findings are retained for 90 days by default. You can export them to S3 for longer-term storage. For Art.5(1)(e) compliance, you need a retention policy for findings that is proportionate to their purpose. If findings contain PII excerpts (the default), those excerpts are subject to the same retention rules as the source data. Most teams define retention for their primary data stores but forget that findings are a secondary store containing PII. Those findings may outlive the source data they describe, violating Art.5(1)(e) storage limitation for the original data subjects. ## The EU AI Act Dimension: Art.29 Deployer Obligations from August 2026 Macie uses machine learning for classification. Under the EU AI Act, it qualifies as an AI system. Whether it is classified as "high-risk" under Annex III depends on use case. If you use Macie to make decisions about individuals — for example, to determine that a customer's data has been exposed in a misconfigured bucket and to take action on that assessment — it may cross into high-risk territory under Annex III, point 8 (law enforcement adjacent) or Annex III, point 6 (employment/workers management, if applied to employee data). Even if Macie is not high-risk, Art.29 deployer obligations apply from August 2026 for any AI system in scope of the EU AI Act (which covers any AI system placed on the EU market or used in the EU). Your obligations include: - Human oversight measures: documenting how humans review and can override Macie's classifications - Technical documentation: maintaining records of how Macie is configured, what data it processes, and what decisions it influences - Monitoring: implementing logging of Macie's decisions and their outcomes - Accuracy and robustness: documenting Macie's false positive/false negative rates for EU-relevant PII types AWS will provide documentation to help you meet these obligations, but the responsibility for deploying AI systems in compliance with Art.29 sits with you, not with AWS. ## The CLOUD Act Stack: Why EU Jurisdiction Does Not Help A common response to CLOUD Act concerns is to deploy in AWS eu-west-1 (Ireland) or eu-central-1 (Frankfurt). The belief is that EU-located data is outside US jurisdiction. This belief is incorrect for any data processed by a US-incorporated entity. AWS is an Amazon.com subsidiary. Amazon.com is incorporated in Delaware. Under CLOUD Act Section 2713, the US government can compel any US-incorporated provider to produce data it "possesses, custodies, or controls" — regardless of where that data is stored. The phrase "possesses, custodies, or controls" has been interpreted to include data that AWS can access via its own systems, even when stored in European data centres. Macie's classification pipeline processes your data. That processing occurs within Amazon's infrastructure. Amazon possesses, has custody of, and controls that data during processing. A valid CLOUD Act order can reach it. The fix is not choosing a different AWS region. The fix is choosing a provider that is not subject to US law. ## EU-Native PII Discovery Alternatives ### Microsoft Presidio (Open-Source, Self-Hosted) [Presidio](https://microsoft.github.io/presidio/) is an open-source PII detection and anonymisation framework from Microsoft Research. It runs locally — no data leaves your infrastructure. It supports 50+ PII entity types out of the box, with custom recognisers for domain-specific data. Presidio uses a combination of rule-based recognition (regex + context), named entity recognition (NER) via spaCy models, and custom ML models. You can run it as a REST API or Python library. It processes text, images, and structured data. For S3-equivalent EU storage scanning, you combine Presidio with a workflow orchestrator (Apache Airflow, Prefect, or plain Python scripts) that pulls objects from your EU-hosted object store, runs them through Presidio locally, and generates findings in your own format with your own retention policy. No CLOUD Act exposure. No Art.28 DPA with a US company required. Full control over what findings contain and how long they are retained. **Self-hosting:** Deploy on sota.io or any EU VPS. Presidio's Docker images are small (~300MB). A 2-core/2GB instance handles moderate scanning workloads. ### OpenDLP (Open-Source DLP Scanner) OpenDLP is a data loss prevention scanning framework for discovering sensitive data across filesystems, databases, and object stores. It is fully self-hosted and generates findings in your infrastructure. For EU teams focused on Art.30 compliance, OpenDLP can be integrated with MinIO (self-hosted S3-compatible storage) to scan existing object stores without S3 dependency. ### DataHub (Open-Source Metadata Platform with PII Classification) [DataHub](https://datahubproject.io/) by LinkedIn is an open-source metadata management platform that includes PII classification as a feature. It builds a data catalogue across your infrastructure, identifies where PII is stored, and allows you to tag and govern that data. DataHub's PII classification uses metadata-based inference (column names, sample values) rather than exhaustive content scanning, which means lower compute requirements but potentially lower recall for unstructured data. DataHub is self-hostable, AGPL-3.0 licensed, and actively maintained. No US cloud dependency. ### EU-Hosted Managed Alternatives Several EU-based cloud providers offer data governance and PII discovery services: **OVHcloud Data Scanner:** OVHcloud's managed data governance tools include S3-compatible bucket scanning with GDPR-specific classifications. No CLOUD Act risk — OVHcloud is incorporated in France. **Scaleway Object Lifecycle Policies + Custom Lambda:** Scaleway (French, GDPR-native) supports lifecycle policies on object storage. Combined with custom serverless functions running Presidio, you can build a Macie-equivalent entirely within French jurisdiction. **Hetzner + Presidio:** Hetzner (German) provides low-cost compute in Nuremberg and Falkenstein. Running Presidio on a Hetzner instance scanning Hetzner Object Storage is a zero-US-jurisdiction PII discovery stack that costs a fraction of Macie's per-GB pricing. ### sota.io: Self-Hosted PII Discovery Pipeline If you want the operational simplicity of a managed service with the data sovereignty of self-hosting, sota.io lets you deploy a Presidio-based PII discovery API as a persistent service in EU infrastructure. A typical architecture: ``` EU Object Store (MinIO / Hetzner / Scaleway) ↓ sota.io worker (Presidio REST API, 2-core/2GB) ↓ PII findings → your PostgreSQL / Elasticsearch ↓ Art.30 ROPA dashboard (your choice of tooling) ``` No data leaves EU. No US law applies. No CLOUD Act exposure. Your Art.28 DPA is with sota.io (a European company) rather than Amazon. Your Art.35 DPIA documents a closed-loop system where PII never travels to non-EU infrastructure. ## Building an Art.30-Compliant PII Discovery Workflow A GDPR-compliant PII discovery process has four layers: **1. Discovery scope definition (before scanning)** Document which data stores contain personal data, why you believe they contain personal data, and the legal basis for scanning. This becomes the input to your Art.30 record. Scanning without scope definition creates new Art.30 obligations for the scan itself. **2. Scan execution (with data minimisation)** Use a tool that does not retain excerpts. Configure your scanner to output the location and type of PII found — not the PII itself. A finding that says "found 47 instances of IBAN in s3://bucket/invoices/2024/" is sufficient for Art.30 purposes. A finding that includes the actual IBAN numbers is a new data collection that requires its own legal basis. **3. Findings governance** Treat findings as personal data. Apply a retention schedule. Restrict access to findings to roles with a documented need. If findings contain excerpts (which they should not), apply the same data subject rights (access, erasure, rectification) to the findings as to the source data. **4. DPIA documentation** Your Art.35 DPIA for the scanning process should document: - The scanning tool and its data flows - CLOUD Act risk assessment (if using any US-incorporated tool) - Findings retention and access controls - Residual risk accepted and sign-off from DPO ## Migrating from Macie: A Practical Checklist If you are currently using Macie and want to migrate to an EU-native alternative: - [ ] Audit current Macie configuration: which buckets are monitored, what findings are generated, whether excerpts are enabled - [ ] Export current findings to your own storage before disabling Macie (findings are deleted when you disable Macie) - [ ] Set up Presidio or your chosen alternative on EU infrastructure - [ ] Replicate Macie's custom data identifiers as Presidio custom recognisers - [ ] Configure your scanning schedule to match Macie's continuous monitoring (or scheduled scans if continuous is cost-prohibitive) - [ ] Update Art.30 ROPA to reflect the new processor (removing AWS, adding your EU-hosted alternative) - [ ] Update your Art.35 DPIA - [ ] Update Art.28 DPA with new processor - [ ] Notify your DPO Migration typically takes one to two days of engineering time for a standard S3 workload. The GDPR paperwork update is the longer task. ## Cost Comparison Macie pricing is based on data scanned: - **Automated sensitive data discovery:** $0.10 per GB per month (first 50GB/month included in free tier) - **Sensitive data discovery jobs:** $1.00 per GB scanned (one-time jobs) For a 10TB data lake, automated discovery costs approximately $1,000/month. An equivalent Presidio stack on sota.io: - sota.io 2-core/2GB instance: ~€10/month - Object storage (Hetzner, 10TB): ~€50/month - Engineering time to maintain: 2-4 hours/month Total: ~€65/month versus ~€1,100/month (plus the legal risk differential which is not easily quantified but represents potential fines under GDPR Art.83 of up to €20M or 4% global turnover). ## The Compliance Paradox AWS Macie exists to solve a GDPR compliance problem. The practical irony is that deploying Macie without a thorough Art.35 DPIA, a realistic CLOUD Act risk assessment, and appropriate findings governance creates GDPR compliance problems of its own. Teams that use Macie to demonstrate GDPR compliance to auditors face a difficult question: if the auditor asks about CLOUD Act risk for the Macie findings themselves, what is your answer? The technically honest answer is that there is no contractual mitigation for CLOUD Act statutory obligations, that your DPIA must document this as an accepted residual risk, and that your DPO must have signed off on that acceptance. Many teams have not had this conversation. Macie is seen as the GDPR compliance tool, not as a tool that itself requires GDPR compliance governance. EU-native alternatives sidestep this entirely. There is no CLOUD Act exposure to document. The DPIA covers a simpler risk model. And the cost is lower. For EU developers who want to run PII discovery without building a GDPR case study about the PII discovery tool itself, self-hosted Presidio on EU infrastructure — or a managed deployment via sota.io — is the architecturally correct choice.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.