2026-04-30·12 min read

AWS Athena EU Alternative 2026: Serverless SQL, Data Lake Queries, and the GDPR Query History Problem

Post #723 in the sota.io EU Compliance Series

AWS Athena is Amazon's serverless interactive query service for S3-based data lakes. You define a schema in the AWS Glue Data Catalog, point Athena at an S3 bucket, and run standard SQL without provisioning servers or managing clusters. European engineering and data teams use Athena for ad hoc analytics, BI reporting, data lake exploration, cost-efficient historical analysis, and business intelligence dashboards via Amazon QuickSight.

Amazon operates Athena in European regions: eu-west-1 (Ireland), eu-central-1 (Frankfurt), eu-west-3 (Paris), eu-north-1 (Stockholm). The data Athena queries resides in your European S3 buckets. Most teams treat this as a GDPR-compliant data analytics layer.

It is not. Amazon Web Services, Inc. is a Delaware corporation headquartered in Seattle, Washington. The CLOUD Act (18 U.S.C. § 2713) compels US companies to produce data stored anywhere in the world when a valid US government order is served. Athena's query engine, query history, workgroup configuration, and query result files — all managed by AWS in your European region — are reachable by a US authority serving a request on Amazon in Seattle.

This analysis applies across the AWS stack: AWS Redshift, AWS Glue, AWS Kinesis, AWS OpenSearch, AWS S3. Athena adds a dimension that is specific to interactive analytics: your SQL queries are personal data records, and your query results are untracked copies of your most sensitive datasets.

What AWS Athena Stores About Your Data Processing

Athena is not a passive query engine. Every interaction with the service — every query submitted, every result written, every workgroup accessed — creates records under AWS management.

Query History: SQL Queries as Personal Data Records

Athena maintains query history for every query executed in a workgroup. The default query history retention in the Athena console is 30 days, but CloudTrail captures every StartQueryExecution API call indefinitely (limited only by your CloudTrail retention configuration, typically 90 days to years).

This matters because SQL queries frequently contain personal data values as literals.

Consider what analysts type into Athena when investigating user issues, debugging data pipelines, or preparing compliance reports:

SELECT * FROM orders WHERE customer_email = 'maria.schneider@example.de';
SELECT * FROM users WHERE user_id IN (12345, 67890, 11111);
SELECT * FROM health_records WHERE patient_national_id = 'DE-12345678';
SELECT * FROM events WHERE ip_address = '85.214.100.1' AND date = '2026-03-15';

Every one of these queries is stored in CloudTrail with the full SQL text. The query contains personal data — an email address, a user ID, a national identifier, an IP address. Under GDPR Art. 4(1), that SQL string, linked to a timestamp and an IAM user who executed it, is itself a personal data record: it documents that a specific AWS user accessed data about a specific individual at a specific time.

If your analysts use Athena for user support lookups, compliance investigations, or ad hoc data exploration, your CloudTrail logs contain a complete audit trail of personal data access — all under US jurisdiction.

Query Result Files: Untracked S3 Copies of Personal Data

Every Athena query writes its results to an S3 location called the output location. By default, results land in s3://aws-athena-query-results-{account-id}-{region}/ with a generated UUID path.

This creates a systematic problem for GDPR Art. 5(1)(e) (storage limitation) and Art. 17 (right to erasure):

Scenario: An analyst runs a query retrieving all orders for customers in a specific postal code. The result — a CSV file containing names, email addresses, order history — lands in the Athena output bucket. The analyst downloads the result, completes their analysis, and moves on. The CSV file remains in S3 indefinitely unless lifecycle policies explicitly delete it.

Most teams configure lifecycle policies on their business data buckets. Almost no teams configure lifecycle policies on their Athena output buckets, because the output bucket is infrastructure-managed, not application-managed. The result: every analytics query over personal data creates an untracked copy with no retention limit.

When a GDPR Art. 17 erasure request arrives, your engineering team deletes the customer record from the primary database and updates the S3 data lake. The copy in the Athena output bucket — containing that customer's data, created by an analyst six months ago for a business report — goes unnoticed. The erasure is incomplete.

Workgroups and Named Queries: Saved SQL as Data Access Records

Athena Workgroups allow teams to organize queries by department, project, or data domain. Named Queries let analysts save frequently-used SQL templates. Both create persistent records of your data access patterns under AWS management.

Named Queries that contain personal data values (hardcoded IDs, email domains, specific identifier patterns) are stored in the Athena service indefinitely until explicitly deleted. They represent documented data processing activities — activities that must appear in your Art. 30 Records of Processing if they involve personal data.

Workgroup query history, including saved queries and recent execution history, is accessible via the AWS Athena API — meaning it is subject to the same CLOUD Act production obligations as any other AWS-managed data.

CTAS: Accidental Data Lake Duplication

Athena supports CREATE TABLE AS SELECT (CTAS) — a SQL statement that executes a query and writes the results to a new S3 location as a new Glue Data Catalog table.

CTAS is powerful for creating derived datasets, pre-aggregated reporting tables, and filtered views of large datasets. It is also one of the most common ways to accidentally duplicate personal data in a data lake:

-- This creates a permanent new S3 dataset containing personal data
CREATE TABLE analytics.active_users_q1
WITH (
  format = 'PARQUET',
  external_location = 's3://my-analytics-bucket/active_users_q1/'
)
AS SELECT user_id, email, name, created_at
FROM production.users
WHERE created_at >= '2026-01-01';

The new table is registered in Glue, appears in Athena's schema browser, and persists until someone explicitly drops it. If the analyst who created it leaves the company, or if the project that required it completes, the CTAS table may remain in the data lake indefinitely — a copy of personal data with no defined retention, no documented legal basis in the Art. 30 register, and no connection to any erasure workflow.

The Federated Query Jurisdiction Expansion Problem

Athena Federated Query allows Athena to query data sources beyond S3: Amazon RDS, Amazon DynamoDB, Amazon CloudWatch Logs, Amazon Redshift, and any custom data source via a Lambda-based connector.

Federated queries create a data access pattern that is structurally difficult to map in GDPR Art. 30 documentation. A single Athena federated query can:

Pull customer records from RDS (Frankfurt)
Join them with event logs from CloudWatch Logs (Frankfurt)
Enrich with DynamoDB session data (Frankfurt)
Write the combined result to an S3 output location (Frankfurt)

Every component of this query operates in a European AWS region. But every component is orchestrated by the Athena query engine, a US-company-managed service that logs the entire operation in CloudTrail. The federated query result — combining PII from three sources — lands in the Athena output bucket under the same untracked lifecycle problem described above.

The Art. 30 challenge: to document this data flow accurately, you must enumerate not just that you process customer data in Frankfurt, but that Athena federates access across three services simultaneously, creates a combined output, and retains query history under US jurisdiction. Most Art. 30 records describe the primary data store; they do not capture the query engine's access patterns.

The Pay-Per-Scan PII Amplification Problem

Athena charges $5 per TB of data scanned. This pricing model creates a GDPR-relevant behavioral incentive that most teams overlook.

When analysts scan unpartitioned tables, Athena scans the entire dataset. A query that retrieves a single customer's records from an unpartitioned 500 GB table scans all 500 GB — processing the personal data of every customer in the table to return one row. Under GDPR Art. 5(1)(c) (data minimisation), processing far more personal data than necessary to fulfill a request is a compliance issue, not merely an efficiency issue.

Well-designed Athena configurations use partitioned tables and column-level filtering to minimize unnecessary data scans. But the default configuration has no built-in PII awareness — it scans whatever it is pointed at. Without explicit partition design and query governance, Athena systematically over-processes personal data relative to what each query actually requires.

Lake Formation Column/Row Security: Opt-In, Not Default

Amazon Lake Formation provides fine-grained access control for Athena queries: column-level security (prevent analysts from seeing email, phone, SSN columns), row-level filtering (restrict access to records matching specific criteria), and tag-based access control.

Lake Formation's Athena integration can significantly reduce personal data exposure in analytics workflows. It is also entirely opt-in. A default Athena deployment — Glue Data Catalog, S3 data lake, Athena workgroup — has no row or column filtering. Any IAM user with Athena query permissions can read any column in any Glue-catalogued table.

The compliance implication: GDPR Art. 25 (data protection by design and by default) requires that data protection measures are built into systems by default, not added as optional layers. A data analytics platform where analysts can query personal data columns without Lake Formation configuration does not meet the data protection by default standard.

EU-Native Alternatives to AWS Athena

The following tools provide equivalent serverless SQL analytics capabilities with data residency on EU-sovereign infrastructure — no US parent company, no CLOUD Act exposure.

Trino (Formerly PrestoSQL) — EU Self-Hosted

Trino is the open-source distributed SQL query engine that powers major data lake platforms including Amazon Athena itself. AWS built Athena on Presto; Trino is the independent successor maintained by the Trino Software Foundation.

Self-hosted Trino deployed on European infrastructure (Hetzner, OVH Cloud, Scaleway, Deutsche Telekom Open Telekom Cloud) provides identical SQL analytics capabilities with full data sovereignty:

Queries S3-compatible storage (MinIO self-hosted, or Scaleway Object Storage)
Connectors for PostgreSQL, MySQL, Cassandra, Kafka, MongoDB, Elasticsearch
Federated queries across multiple data sources
Role-based access control built into the Trino coordinator
Hive metastore or Glue-compatible schema management (via Apache Hive or Nessie catalog)

With Trino on EU infrastructure, query history logs are stored on your servers. Result files land in your S3-compatible storage with lifecycle policies you control. CTAS creates tables in your catalog under governance you define. No query data leaves EU jurisdiction.

Deploy Trino on sota.io with full EU data sovereignty — Docker Compose or Kubernetes deployment in under 30 minutes.

DuckDB — In-Process EU Serverless Analytics

DuckDB is an in-process OLAP database that runs inside your application process, Jupyter notebook, or CLI with no server infrastructure required. For analytics workflows that don't require distributed processing, DuckDB eliminates the managed service entirely.

DuckDB runs on any EU server and queries:

Local Parquet, CSV, JSON files
S3-compatible storage (MinIO, Scaleway, OVH) via the S3 extension
Delta Lake tables
Apache Iceberg tables
PostgreSQL (via the Postgres scanner extension)

For teams using Athena primarily for ad hoc data exploration, BI reporting, or ETL validation, DuckDB running on an EU VM is a complete replacement. Query history is local to the machine. Results stay in your storage. There is no managed service — and therefore no CLOUD Act surface.

ClickHouse — EU-Deployable Columnar Database

ClickHouse is an open-source columnar database optimized for analytical queries. While not a serverless SQL-over-S3 engine like Athena, ClickHouse replaces Athena for teams whose primary use case is time-series analytics, event data analysis, or high-throughput aggregation queries.

ClickHouse Community Edition self-hosted on EU infrastructure provides:

Sub-second aggregation queries over billions of rows
Native S3-backed storage (ClickHouse on S3 or SharedMergeTree)
Role-based access control and column-level encryption
Materialized views as real-time pre-aggregation
GDPR erasure via ALTER TABLE ... DELETE WHERE user_id = ?

ClickHouse on Hetzner or OVH provides a cost-effective alternative to Athena for teams processing large volumes of event or log data.

Apache Spark on EU Kubernetes

Apache Spark provides distributed data processing with full SQL analytics via Spark SQL. Deployed on a self-managed Kubernetes cluster in an EU data center, Spark provides:

Full SQL analytics over Parquet, Delta Lake, Apache Iceberg
PySpark and Scala APIs for complex transformations
Spark Structured Streaming for real-time analytics
Integration with Hive metastore or Apache Nessie for schema management

For teams with existing Spark expertise, migrating from Athena to self-hosted Spark eliminates the managed query service while retaining familiar SQL semantics.

Comparison: AWS Athena vs EU-Native Alternatives

Feature	AWS Athena	Trino (EU)	DuckDB (EU)	ClickHouse (EU)
Data jurisdiction	US CLOUD Act	EU sovereign	EU sovereign	EU sovereign
Query history storage	CloudTrail (AWS)	Local logs	Local logs	Local logs
Result file lifecycle	Manual S3 policy	Your storage policy	Local file	Table-managed
Setup complexity	Zero (managed)	Medium	Very low	Low–medium
Federated queries	Yes (Lambda)	Yes (connectors)	Yes (extensions)	Partial
Column security	Lake Formation (opt-in)	Built-in RBAC	Built-in RBAC	Built-in RBAC
Scale	Petabytes (managed)	Petabytes (clustered)	GBs–TBs (single node)	TBs–PBs
Cost model	$5/TB scanned	Infrastructure cost	Infrastructure cost	Infrastructure cost
GDPR compliance	Structural gap	Full control	Full control	Full control

If you are currently running AWS Athena and have not audited your GDPR exposure, several actions are required:

1. Audit query history in CloudTrail: Run a CloudTrail Insights query to identify Athena StartQueryExecution events over the past 90 days. Export the query text and check for SQL literals that contain personal data values (email addresses, user IDs, national identifiers). Document findings in your Art. 30 register.

2. Configure lifecycle policies on the Athena output bucket: Every S3 bucket used as an Athena output location requires a lifecycle policy aligned to your documented retention periods. A 30-day lifecycle policy on the output bucket prevents indefinite accumulation of query result copies. Apply this immediately.

3. Audit CTAS tables in Glue Data Catalog: List all tables in your Glue catalog created via CTAS. For each table, verify that a legal basis exists for retaining the data, that the retention period is defined, and that a data owner is documented in your Art. 30 records. Delete CTAS tables that have no documented legal basis.

4. Assess Lake Formation implementation: If your Athena deployment processes personal data categories (customer PII, behavioral analytics, health data), evaluate whether Lake Formation column-level security is configured. An absence of column-level security on personal data columns is an Art. 25 compliance gap.

5. Address the CLOUD Act exposure: If your data lake contains special category data (health records, financial data, political opinion data) and Athena is your query layer, the structural CLOUD Act exposure is not solvable through configuration. Only moving the query layer to EU-sovereign infrastructure — Trino, DuckDB, or ClickHouse on EU VMs — resolves it.

Conclusion

AWS Athena is the most deceptively GDPR-complex service in the analytics stack. It appears to be stateless — you query S3, get results, pay per TB. In reality, Athena creates personal data records in three places: CloudTrail query history containing SQL literals, untracked result files in S3 output locations, and CTAS-derived tables with no defined retention. Each of these creates compliance gaps that configuration alone cannot fully close while Athena remains a US-jurisdiction managed service.

For new data lake projects in 2026, the EU-sovereign path is clear:

Ad hoc SQL analytics: Trino on Hetzner/OVH querying MinIO or Scaleway Object Storage — identical SQL semantics, full query history control
Single-node analytics and data exploration: DuckDB on any EU VM — zero managed service, runs anywhere, full SQL support
High-throughput event analytics: ClickHouse CE on EU infrastructure — sub-second aggregations, built-in access control

For teams migrating from Athena, the SQL migration path is straightforward: Trino is syntax-compatible with Athena (both use Presto SQL dialect), making query migrations largely mechanical. The real migration effort is rearchitecting the output location lifecycle management — replacing the ad hoc Athena output bucket with governed storage in your data lake with defined retention policies.

Part of the sota.io EU Compliance Series — covering every AWS service through the lens of GDPR and the CLOUD Act.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View plans