AWS Redshift EU Alternative 2026: Data Warehouse, GDPR Compliance, and CLOUD Act Risk
Post #721 in the sota.io EU Compliance Series
AWS Redshift is Amazon's managed cloud data warehouse. Organizations use it to store and analyze years of transactional history, behavioral data, product analytics, and customer records at petabyte scale. Redshift clusters hold the most complete picture of user behavior available in any organization's infrastructure: every purchase, every session, every support interaction, aggregated into a queryable historical record.
That record exists in AWS infrastructure subject to US CLOUD Act jurisdiction. A valid US government order served on Amazon can compel disclosure of a Redshift cluster's contents — years of customer transaction history, behavioral analytics, and user profiles — without requiring the data controller to be notified in advance.
For organizations subject to GDPR, a data warehouse presents a specific compliance challenge that differs from transactional databases or streaming infrastructure. A data warehouse is designed for retention: data flows in from operational systems, is transformed and aggregated, and is kept for months or years to support historical analysis. The purpose of a data warehouse is to accumulate personal data over time. That accumulation creates GDPR obligations around storage limitation, purpose limitation, and the right to erasure that are significantly harder to fulfill when the storage layer is a US-jurisdiction service.
What AWS Redshift Stores
Redshift's role in the data architecture determines what personal data it holds and under what GDPR obligations.
Transactional History and Customer Records
The most common Redshift use case for consumer-facing organizations is an analytics layer over operational databases. ETL pipelines (often AWS Glue) extract records from RDS or DynamoDB and load them into Redshift fact and dimension tables.
A typical customer analytics schema:
-- Dimension table: customer records
CREATE TABLE dim_customers (
customer_id BIGINT SORTKEY,
email VARCHAR(255),
first_name VARCHAR(100),
last_name VARCHAR(100),
date_of_birth DATE,
phone VARCHAR(20),
country CHAR(2),
city VARCHAR(100),
postal_code VARCHAR(20),
created_at TIMESTAMP,
last_login TIMESTAMP,
account_status VARCHAR(20),
consent_email BOOLEAN,
consent_tracking BOOLEAN
) DISTSTYLE KEY DISTKEY(customer_id);
-- Fact table: order history
CREATE TABLE fact_orders (
order_id BIGINT SORTKEY,
customer_id BIGINT DISTKEY,
order_date TIMESTAMP,
total_eur DECIMAL(10,2),
payment_method VARCHAR(50),
shipping_street VARCHAR(255),
shipping_city VARCHAR(100),
shipping_postal VARCHAR(20),
shipping_country CHAR(2),
items_count INTEGER,
status VARCHAR(20)
) DISTSTYLE KEY DISTKEY(customer_id);
-- Fact table: behavioral events
CREATE TABLE fact_events (
event_id BIGINT,
customer_id BIGINT DISTKEY,
session_id VARCHAR(64),
event_type VARCHAR(50),
page_url VARCHAR(500),
ip_address VARCHAR(45),
user_agent TEXT,
device_type VARCHAR(20),
referrer VARCHAR(500),
event_timestamp TIMESTAMP SORTKEY
) DISTSTYLE KEY DISTKEY(customer_id);
A Redshift schema like this holds: full customer identity (name, email, address, phone, date of birth), complete order history with shipping addresses and payment method references, and full behavioral event logs including IP addresses, user agents, and browsing patterns. Every field individually qualifies as personal data under GDPR Article 4.
In a typical e-commerce deployment, a Redshift cluster retains this data for three to five years for business analytics, trend analysis, and machine learning training. Under GDPR Article 5(1)(e), personal data must be kept for no longer than necessary for the purpose for which it was collected. Retaining customer behavioral data for five years in a data warehouse requires a documented legal basis and storage limitation policy — which the majority of Redshift deployments lack.
Redshift Spectrum and the Federated Data Problem
Redshift Spectrum extends Redshift queries to data stored in S3, allowing analysts to query raw log files and data lake content without loading it into Redshift tables.
-- Spectrum query: joining Redshift tables with S3 data lake
-- This query touches personal data in multiple jurisdictions
CREATE EXTERNAL TABLE spectrum_raw_logs (
timestamp TIMESTAMP,
user_id BIGINT,
ip_address VARCHAR(45),
request_path VARCHAR(500),
response_code INTEGER,
bytes_sent INTEGER,
user_agent TEXT,
referer VARCHAR(500)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://access-logs-prod/2026/';
-- Analyst query: session analysis combining warehouse + raw logs
SELECT
c.email,
c.country,
COUNT(DISTINCT l.ip_address) AS ip_count,
COUNT(*) AS request_count,
MIN(l.timestamp) AS first_seen,
MAX(l.timestamp) AS last_seen
FROM spectrum_raw_logs l
JOIN dim_customers c ON c.customer_id = l.user_id
WHERE l.timestamp >= DATEADD(day, -90, GETDATE())
GROUP BY c.email, c.country
ORDER BY request_count DESC;
Spectrum queries execute in Redshift's compute layer but read data from S3. For GDPR purposes, this means a single Spectrum query can touch personal data in both Redshift (the processed, structured warehouse) and S3 (raw log files or data lake content) — both under US CLOUD Act jurisdiction — in a single execution.
Spectrum also enables cross-account data sharing: an organization can grant Redshift Spectrum access to S3 buckets in other AWS accounts. In multi-entity corporate structures, this creates data processing agreements that must be evaluated under GDPR Article 28, as the Spectrum query engine acts as a processor handling personal data from the S3 account.
Redshift ML and the SageMaker Layer
Redshift ML integrates Amazon SageMaker to train machine learning models directly on Redshift data:
-- Train a customer churn prediction model on Redshift data
CREATE MODEL customer_churn_model
FROM (
SELECT
c.customer_id,
c.country,
c.date_of_birth,
DATEDIFF(day, c.created_at, GETDATE()) AS account_age_days,
DATEDIFF(day, c.last_login, GETDATE()) AS days_since_login,
COUNT(o.order_id) AS total_orders,
SUM(o.total_eur) AS total_spent,
MAX(o.order_date) AS last_order_date,
COUNT(e.event_id) AS total_events,
CASE WHEN c.account_status = 'churned' THEN 1 ELSE 0 END AS churned
FROM dim_customers c
LEFT JOIN fact_orders o ON c.customer_id = o.customer_id
LEFT JOIN fact_events e ON c.customer_id = e.customer_id
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-ml-role'
SETTINGS (S3_BUCKET 'ml-model-artifacts-prod');
This SQL statement creates a machine learning model trained on personal data — customer age, purchase history, behavioral event counts, account status. The training process extracts data from Redshift, transfers it to AWS SageMaker for model training, and stores the resulting model artifacts in S3.
Under GDPR, using personal data to train a machine learning model is a processing activity that requires a legal basis and must be documented in Article 30 records. The model itself — encoding statistical patterns derived from individual customer behavior — may constitute personal data if it can be used to make decisions about or identify individuals. Redshift ML adds SageMaker as a second US-jurisdiction service processing the same personal data as Redshift.
Redshift Serverless and the Jurisdiction Gap
Redshift Serverless removes the cluster management requirement, providing on-demand data warehouse capacity without pre-provisioned nodes.
import boto3
# Redshift Serverless query execution
client = boto3.client('redshift-data', region_name='eu-central-1')
# Execute query against Serverless workgroup
response = client.execute_statement(
WorkgroupName='analytics-workgroup',
Database='production',
Sql="""
SELECT
customer_id,
email,
SUM(total_eur) AS lifetime_value,
COUNT(order_id) AS order_count,
MAX(order_date) AS last_purchase
FROM fact_orders
JOIN dim_customers USING (customer_id)
WHERE country = 'DE'
GROUP BY 1, 2
HAVING SUM(total_eur) > 500
ORDER BY lifetime_value DESC
"""
)
Redshift Serverless abstracts the underlying infrastructure, making it even less transparent than standard Redshift clusters where the provisioned node count and instance types are visible. From a GDPR audit perspective, an organization using Redshift Serverless cannot verify the specific infrastructure their personal data runs on, the number of nodes that processed a query, or the geographic distribution of query execution — the Serverless layer manages all of this opaquely.
The IAM policy required for Redshift Serverless access — redshift-serverless:ExecuteStatement — grants the ability to run arbitrary SQL against all data in the workgroup. In organizations where IAM policies are broadly assigned to analytics users, this creates an access control surface that is difficult to audit for GDPR data minimization compliance.
The Data Warehouse GDPR Risk Profile
Data warehouses create GDPR risks that differ structurally from operational databases or streaming systems.
Storage Limitation at Petabyte Scale
GDPR Article 5(1)(e) requires personal data to be stored for no longer than necessary. In operational databases, records are typically deleted when they are no longer needed for the transaction (a customer record is retained while the customer account is active). Data warehouses invert this: data is never deleted from the warehouse because historical completeness is the point.
A Redshift cluster that has been running for three years may contain:
- Customer records for users who have deleted their accounts (right to erasure requests were fulfilled in the operational database but the warehouse was never updated)
- IP address logs from before consent management was implemented (no legal basis for processing)
- Behavioral data from users who have withdrawn consent for behavioral tracking
- Transaction history for jurisdictions where the limitation period for financial data has passed
The data warehouse is the graveyard of personal data that should have been deleted from operational systems but was never purged from analytics infrastructure. Under GDPR, the right to erasure (Article 17) applies to all copies of personal data, including copies in data warehouses and data lakes. Organizations that have fulfilled erasure requests in production databases but not in Redshift are non-compliant with every such request.
Purpose Limitation and Analytics Scope Creep
GDPR Article 5(1)(b) requires personal data to be collected for specified, explicit, and legitimate purposes. Data warehouses enable any analyst to run any query against all available data — a structural challenge for purpose limitation.
-- Query that likely exceeds the original purpose of data collection
-- Marketing analyst exploring health-correlated purchase patterns
SELECT
FLOOR(DATEDIFF(year, date_of_birth, GETDATE()) / 10) * 10 AS age_decade,
product_category,
AVG(total_eur) AS avg_spend,
COUNT(*) AS purchase_count
FROM fact_orders
JOIN dim_customers USING (customer_id)
JOIN dim_products USING (product_id)
WHERE product_category IN ('vitamins', 'supplements', 'medical-devices', 'home-health')
GROUP BY 1, 2
ORDER BY avg_spend DESC;
A marketing analyst running this query is correlating customer age with purchases in health-adjacent product categories. Whether health product purchases constitute health data under GDPR Article 9 is a legal question. But the key point is that a data warehouse where analysts can freely join any dimension against any fact table makes purpose limitation unenforceably broad.
The GDPR compliance framework for a Redshift deployment typically consists of a vague statement that data is used "for analytics and business improvement." This purpose description covers every query an analyst could run, which means it effectively covers no specific purpose at all.
The Right to Erasure at Warehouse Scale
Implementing the right to erasure in a columnar data warehouse like Redshift is operationally expensive:
-- Right to erasure implementation for a single customer in Redshift
-- Must be run across all tables containing personal data
BEGIN;
-- Delete from fact tables (expensive full scans without customer_id sort key)
DELETE FROM fact_events WHERE customer_id = 7842913;
DELETE FROM fact_orders WHERE customer_id = 7842913;
DELETE FROM fact_page_views WHERE customer_id = 7842913;
DELETE FROM fact_support_tickets WHERE customer_id = 7842913;
DELETE FROM fact_email_interactions WHERE customer_id = 7842913;
-- Update dimension table (cannot delete without breaking referential integrity)
UPDATE dim_customers
SET
email = 'deleted@erasure.invalid',
first_name = 'DELETED',
last_name = 'DELETED',
phone = NULL,
date_of_birth = NULL,
postal_code = NULL,
city = NULL
WHERE customer_id = 7842913;
COMMIT;
-- Reclaim disk space (required to physically remove data from storage)
VACUUM DELETE ONLY fact_events;
VACUUM DELETE ONLY fact_orders;
-- ... one VACUUM per table
Redshift's columnar storage means DELETE operations mark rows as deleted but do not immediately reclaim disk space. The VACUUM command reclaims space but is computationally expensive and cannot run concurrently with large query loads. In a large Redshift cluster under continuous analytical query load, running VACUUM DELETE ONLY across a 10TB table to fulfill a single erasure request is operationally impractical.
Organizations that have deployed Redshift without designing erasure support into their ETL pipelines and table schemas frequently discover this problem when processing GDPR erasure requests: they can mark records as deleted in the production database, but physically removing the corresponding rows from Redshift at scale requires a pipeline that was never built.
EU-Sovereign Alternatives for Data Warehousing
ClickHouse (Self-Hosted, EU-Sovereign)
ClickHouse is the highest-performance open-source column-oriented database for analytical queries. It matches or exceeds Redshift query performance for typical analytics workloads and is self-hosted on EU-sovereign infrastructure.
-- ClickHouse equivalent of Redshift schema
CREATE TABLE dim_customers (
customer_id UInt64,
email String,
first_name String,
last_name String,
date_of_birth Date,
phone Nullable(String),
country FixedString(2),
city String,
postal_code String,
created_at DateTime,
last_login DateTime,
account_status LowCardinality(String),
consent_email UInt8,
consent_tracking UInt8
) ENGINE = ReplacingMergeTree()
ORDER BY customer_id
SETTINGS storage_policy = 'eu_sovereign_storage';
CREATE TABLE fact_orders (
order_id UInt64,
customer_id UInt64,
order_date DateTime,
total_eur Decimal(10,2),
payment_method LowCardinality(String),
shipping_city String,
shipping_country FixedString(2),
items_count UInt16,
status LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(order_date)
ORDER BY (customer_id, order_date)
TTL order_date + INTERVAL 3 YEAR;
The TTL order_date + INTERVAL 3 YEAR clause implements GDPR storage limitation directly in the table definition: rows older than three years are automatically deleted. ClickHouse's TTL mechanism enforces storage limitation at the storage engine level without requiring manual cleanup processes — a structural advantage over Redshift for GDPR compliance.
ClickHouse DELETE operations are immediate and do not require a VACUUM equivalent:
-- GDPR erasure in ClickHouse: immediate and complete
ALTER TABLE dim_customers DELETE WHERE customer_id = 7842913;
ALTER TABLE fact_orders DELETE WHERE customer_id = 7842913;
ALTER TABLE fact_events DELETE WHERE customer_id = 7842913;
ClickHouse mutations execute asynchronously but complete within minutes on typical warehouse-scale datasets, without the VACUUM overhead that makes Redshift erasure operationally impractical.
Deployed on EU-sovereign infrastructure via sota.io, ClickHouse provides a complete Redshift equivalent with all data remaining in EU jurisdiction under a single GDPR DPA.
Apache Doris (Self-Hosted, Real-Time Analytics)
Apache Doris is an open-source OLAP database with MySQL protocol compatibility, making it simpler to integrate with existing BI tooling than ClickHouse.
-- Apache Doris table with GDPR-aligned partitioning
CREATE TABLE fact_events (
event_date DATE NOT NULL,
event_id BIGINT NOT NULL,
customer_id BIGINT,
session_id VARCHAR(64),
event_type VARCHAR(50),
page_url VARCHAR(500),
ip_address VARCHAR(45),
event_timestamp DATETIME
)
DUPLICATE KEY(event_date, event_id)
PARTITION BY RANGE(event_date)(
PARTITION p2024 VALUES LESS THAN ('2025-01-01'),
PARTITION p2025 VALUES LESS THAN ('2026-01-01'),
PARTITION p2026 VALUES LESS THAN ('2027-01-01')
)
DISTRIBUTED BY HASH(customer_id) BUCKETS 32
PROPERTIES (
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "YEAR",
"dynamic_partition.start" = "-3",
"dynamic_partition.end" = "1",
"dynamic_partition.prefix" = "p",
"replication_num" = "3"
);
Dynamic partitioning with a -3 start value means partitions older than three years are automatically dropped — GDPR storage limitation enforced at the partition level. Dropping a partition in Doris is an immediate operation that removes all data in that partition without per-row overhead.
Doris supports direct connection from Apache Superset, Metabase, Grafana, and standard JDBC/ODBC clients — covering the BI visualization use cases that organizations typically address with Redshift and QuickSight together.
DuckDB (Lightweight Analytics for Smaller Deployments)
For organizations whose Redshift use is primarily data exploration and ad-hoc analytics at scales below 1TB, DuckDB provides a zero-infrastructure alternative.
import duckdb
import pandas as pd
# DuckDB analytics on Parquet files in EU-sovereign S3-compatible storage
conn = duckdb.connect()
# Register EU-sovereign MinIO as S3 source
conn.execute("""
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.eu.internal:9000';
SET s3_access_key_id='...';
SET s3_secret_access_key='...';
SET s3_use_ssl=true;
SET s3_url_style='path';
""")
# Query Parquet files directly — no Redshift needed
result = conn.execute("""
SELECT
c.country,
DATE_TRUNC('month', o.order_date) AS month,
COUNT(*) AS order_count,
SUM(o.total_eur) AS revenue
FROM read_parquet('s3://eu-datalake/customers/*.parquet') c
JOIN read_parquet('s3://eu-datalake/orders/*.parquet') o
ON c.customer_id = o.customer_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
GROUP BY 1, 2
ORDER BY 1, 2
""").df()
DuckDB runs in-process on EU-sovereign compute, reading Parquet files directly from EU-sovereign object storage. There is no managed service, no control plane, and no US-jurisdiction data path. For analytics teams that primarily run reports and exploratory queries, DuckDB eliminates the Redshift dependency entirely.
Migration Strategy: Redshift to EU-Sovereign Data Warehouse
Phase 1: Inventory Personal Data in Redshift
Before migration, identify every table containing personal data:
import boto3
redshift_data = boto3.client('redshift-data', region_name='eu-central-1')
# Query Redshift information schema for tables with suspicious columns
response = redshift_data.execute_statement(
ClusterIdentifier='analytics-cluster',
Database='production',
DbUser='admin',
Sql="""
SELECT
table_schema,
table_name,
STRING_AGG(column_name, ', ') AS personal_data_columns,
COUNT(*) AS pii_column_count
FROM information_schema.columns
WHERE column_name ILIKE ANY(ARRAY[
'%email%', '%phone%', '%name%', '%address%',
'%birth%', '%ip_address%', '%user_id%', '%customer_id%',
'%ssn%', '%passport%', '%license%', '%iban%'
])
GROUP BY 1, 2
ORDER BY 4 DESC
"""
)
Phase 2: Implement Storage Limitation Before Migration
Redshift clusters often contain years of data with no expiry policy. Before migrating the data to an EU-sovereign warehouse, implement retention limits:
-- Delete behavioral data older than retention policy (run before migration)
DELETE FROM fact_events
WHERE event_timestamp < DATEADD(year, -2, GETDATE());
DELETE FROM fact_page_views
WHERE page_timestamp < DATEADD(year, -2, GETDATE());
-- Keep order history for 5 years (financial obligation basis)
DELETE FROM fact_orders
WHERE order_date < DATEADD(year, -5, GETDATE());
VACUUM DELETE ONLY fact_events;
VACUUM DELETE ONLY fact_page_views;
VACUUM DELETE ONLY fact_orders;
Running retention cleanup before migration reduces the data volume to be transferred and ensures that only data with a current legal basis exists in the new EU-sovereign warehouse.
Phase 3: Export and Transfer to EU-Sovereign Storage
# Export from Redshift to Parquet (EU-sovereign S3 bucket)
aws redshift-data execute-statement \
--cluster-identifier analytics-cluster \
--database production \
--db-user admin \
--sql "UNLOAD ('SELECT * FROM fact_orders WHERE order_date >= DATEADD(year,-5,GETDATE())')
TO 's3://eu-sovereign-migration/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-export'
FORMAT AS PARQUET
ALLOWOVERWRITE;"
# Transfer from AWS S3 to EU-sovereign MinIO
mc mirror s3/eu-sovereign-migration/ \
eu-minio/warehouse-migration/ \
--region eu-central-1
Phase 4: Import to ClickHouse
# Load Parquet files into ClickHouse from EU-sovereign storage
clickhouse-client --query "
INSERT INTO fact_orders
SELECT * FROM s3(
'http://minio.eu.internal:9000/warehouse-migration/orders/*.parquet',
'access_key',
'secret_key',
'Parquet'
)
"
GDPR Compliance Checklist for Data Warehouse Migration
Before decommissioning Redshift:
Data Inventory and Classification:
- All Redshift tables inventoried with personal data classification
- Retention periods documented for each table and justified against legal basis
- Redshift ML models identified and personal data training inputs documented
- Redshift Spectrum external tables mapped to underlying S3 data lake content
Storage Limitation:
- Retention cleanup executed — rows older than policy deleted from Redshift before migration
- TTL or partition-based automatic deletion configured in replacement warehouse
- Backup and snapshot retention reviewed (Redshift automated snapshots may contain deleted personal data)
Erasure Support:
- Erasure request pipeline updated to include replacement warehouse
- Delete performance benchmarked for replacement warehouse at expected request volume
- Customer ID lookup mechanism available in replacement warehouse for erasure requests
Access Control:
- Column-level security implemented for sensitive data categories (payment method, health attributes)
- Row-level security implemented for GDPR-sensitive analyst access patterns
- Query logging enabled in replacement warehouse for Article 30 audit trail
Post-Migration:
- Redshift cluster deleted after data validation in replacement warehouse
- Redshift automated snapshots deleted (they contain the same personal data as the cluster)
- Article 30 records updated to reflect new processing infrastructure
The Data Warehouse as GDPR Liability
Data warehouses occupy a unique position in the GDPR compliance landscape. They are the one system where personal data from every other system converges: transactional systems, behavioral analytics, support history, marketing interactions. They hold this data for years. They are queryable by any authorized analyst.
For organizations that have addressed GDPR compliance in operational systems — consent management at the frontend, DPA agreements with database providers, erasure pipelines in production databases — the data warehouse is often the last system to receive compliance attention and the one that contains the most historical personal data exposure.
Migrating a Redshift deployment to EU-sovereign infrastructure is not only about resolving the CLOUD Act jurisdictional risk. It is about gaining the ability to enforce storage limitation and erasure at the storage engine level, rather than as a manual operational process on top of infrastructure that charges by compute time for the DELETE operations that GDPR requires.
sota.io provides managed deployment of ClickHouse and other analytical databases on EU-sovereign compute, with automatic scaling and no US-jurisdiction cloud services in the data path. Your historical data stays in EU infrastructure under a single GDPR DPA.
This post is part of the sota.io AWS EU Alternative Series. Related posts: AWS Glue EU Alternative, AWS Kinesis EU Alternative, AWS SageMaker EU Alternative, AWS DynamoDB EU Alternative.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.