2026-04-30·14 min read·

AWS Redshift EU Alternative 2026: Data Warehouse, GDPR Compliance, and CLOUD Act Risk

Post #721 in the sota.io EU Compliance Series

AWS Redshift is Amazon's managed cloud data warehouse. Organizations use it to store and analyze years of transactional history, behavioral data, product analytics, and customer records at petabyte scale. Redshift clusters hold the most complete picture of user behavior available in any organization's infrastructure: every purchase, every session, every support interaction, aggregated into a queryable historical record.

That record exists in AWS infrastructure subject to US CLOUD Act jurisdiction. A valid US government order served on Amazon can compel disclosure of a Redshift cluster's contents — years of customer transaction history, behavioral analytics, and user profiles — without requiring the data controller to be notified in advance.

For organizations subject to GDPR, a data warehouse presents a specific compliance challenge that differs from transactional databases or streaming infrastructure. A data warehouse is designed for retention: data flows in from operational systems, is transformed and aggregated, and is kept for months or years to support historical analysis. The purpose of a data warehouse is to accumulate personal data over time. That accumulation creates GDPR obligations around storage limitation, purpose limitation, and the right to erasure that are significantly harder to fulfill when the storage layer is a US-jurisdiction service.

What AWS Redshift Stores

Redshift's role in the data architecture determines what personal data it holds and under what GDPR obligations.

Transactional History and Customer Records

The most common Redshift use case for consumer-facing organizations is an analytics layer over operational databases. ETL pipelines (often AWS Glue) extract records from RDS or DynamoDB and load them into Redshift fact and dimension tables.

A typical customer analytics schema:

-- Dimension table: customer records
CREATE TABLE dim_customers (
    customer_id     BIGINT SORTKEY,
    email           VARCHAR(255),
    first_name      VARCHAR(100),
    last_name       VARCHAR(100),
    date_of_birth   DATE,
    phone           VARCHAR(20),
    country         CHAR(2),
    city            VARCHAR(100),
    postal_code     VARCHAR(20),
    created_at      TIMESTAMP,
    last_login      TIMESTAMP,
    account_status  VARCHAR(20),
    consent_email   BOOLEAN,
    consent_tracking BOOLEAN
) DISTSTYLE KEY DISTKEY(customer_id);

-- Fact table: order history
CREATE TABLE fact_orders (
    order_id        BIGINT SORTKEY,
    customer_id     BIGINT DISTKEY,
    order_date      TIMESTAMP,
    total_eur       DECIMAL(10,2),
    payment_method  VARCHAR(50),
    shipping_street VARCHAR(255),
    shipping_city   VARCHAR(100),
    shipping_postal VARCHAR(20),
    shipping_country CHAR(2),
    items_count     INTEGER,
    status          VARCHAR(20)
) DISTSTYLE KEY DISTKEY(customer_id);

-- Fact table: behavioral events
CREATE TABLE fact_events (
    event_id        BIGINT,
    customer_id     BIGINT DISTKEY,
    session_id      VARCHAR(64),
    event_type      VARCHAR(50),
    page_url        VARCHAR(500),
    ip_address      VARCHAR(45),
    user_agent      TEXT,
    device_type     VARCHAR(20),
    referrer        VARCHAR(500),
    event_timestamp TIMESTAMP SORTKEY
) DISTSTYLE KEY DISTKEY(customer_id);

A Redshift schema like this holds: full customer identity (name, email, address, phone, date of birth), complete order history with shipping addresses and payment method references, and full behavioral event logs including IP addresses, user agents, and browsing patterns. Every field individually qualifies as personal data under GDPR Article 4.

In a typical e-commerce deployment, a Redshift cluster retains this data for three to five years for business analytics, trend analysis, and machine learning training. Under GDPR Article 5(1)(e), personal data must be kept for no longer than necessary for the purpose for which it was collected. Retaining customer behavioral data for five years in a data warehouse requires a documented legal basis and storage limitation policy — which the majority of Redshift deployments lack.

Redshift Spectrum and the Federated Data Problem

Redshift Spectrum extends Redshift queries to data stored in S3, allowing analysts to query raw log files and data lake content without loading it into Redshift tables.

-- Spectrum query: joining Redshift tables with S3 data lake
-- This query touches personal data in multiple jurisdictions
CREATE EXTERNAL TABLE spectrum_raw_logs (
    timestamp       TIMESTAMP,
    user_id         BIGINT,
    ip_address      VARCHAR(45),
    request_path    VARCHAR(500),
    response_code   INTEGER,
    bytes_sent      INTEGER,
    user_agent      TEXT,
    referer         VARCHAR(500)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://access-logs-prod/2026/';

-- Analyst query: session analysis combining warehouse + raw logs
SELECT
    c.email,
    c.country,
    COUNT(DISTINCT l.ip_address) AS ip_count,
    COUNT(*) AS request_count,
    MIN(l.timestamp) AS first_seen,
    MAX(l.timestamp) AS last_seen
FROM spectrum_raw_logs l
JOIN dim_customers c ON c.customer_id = l.user_id
WHERE l.timestamp >= DATEADD(day, -90, GETDATE())
GROUP BY c.email, c.country
ORDER BY request_count DESC;

Spectrum queries execute in Redshift's compute layer but read data from S3. For GDPR purposes, this means a single Spectrum query can touch personal data in both Redshift (the processed, structured warehouse) and S3 (raw log files or data lake content) — both under US CLOUD Act jurisdiction — in a single execution.

Spectrum also enables cross-account data sharing: an organization can grant Redshift Spectrum access to S3 buckets in other AWS accounts. In multi-entity corporate structures, this creates data processing agreements that must be evaluated under GDPR Article 28, as the Spectrum query engine acts as a processor handling personal data from the S3 account.

Redshift ML and the SageMaker Layer

Redshift ML integrates Amazon SageMaker to train machine learning models directly on Redshift data:

-- Train a customer churn prediction model on Redshift data
CREATE MODEL customer_churn_model
FROM (
    SELECT
        c.customer_id,
        c.country,
        c.date_of_birth,
        DATEDIFF(day, c.created_at, GETDATE()) AS account_age_days,
        DATEDIFF(day, c.last_login, GETDATE()) AS days_since_login,
        COUNT(o.order_id) AS total_orders,
        SUM(o.total_eur) AS total_spent,
        MAX(o.order_date) AS last_order_date,
        COUNT(e.event_id) AS total_events,
        CASE WHEN c.account_status = 'churned' THEN 1 ELSE 0 END AS churned
    FROM dim_customers c
    LEFT JOIN fact_orders o ON c.customer_id = o.customer_id
    LEFT JOIN fact_events e ON c.customer_id = e.customer_id
    GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-ml-role'
SETTINGS (S3_BUCKET 'ml-model-artifacts-prod');

This SQL statement creates a machine learning model trained on personal data — customer age, purchase history, behavioral event counts, account status. The training process extracts data from Redshift, transfers it to AWS SageMaker for model training, and stores the resulting model artifacts in S3.

Under GDPR, using personal data to train a machine learning model is a processing activity that requires a legal basis and must be documented in Article 30 records. The model itself — encoding statistical patterns derived from individual customer behavior — may constitute personal data if it can be used to make decisions about or identify individuals. Redshift ML adds SageMaker as a second US-jurisdiction service processing the same personal data as Redshift.

Redshift Serverless and the Jurisdiction Gap

Redshift Serverless removes the cluster management requirement, providing on-demand data warehouse capacity without pre-provisioned nodes.

import boto3

# Redshift Serverless query execution
client = boto3.client('redshift-data', region_name='eu-central-1')

# Execute query against Serverless workgroup
response = client.execute_statement(
    WorkgroupName='analytics-workgroup',
    Database='production',
    Sql="""
        SELECT
            customer_id,
            email,
            SUM(total_eur) AS lifetime_value,
            COUNT(order_id) AS order_count,
            MAX(order_date) AS last_purchase
        FROM fact_orders
        JOIN dim_customers USING (customer_id)
        WHERE country = 'DE'
        GROUP BY 1, 2
        HAVING SUM(total_eur) > 500
        ORDER BY lifetime_value DESC
    """
)

Redshift Serverless abstracts the underlying infrastructure, making it even less transparent than standard Redshift clusters where the provisioned node count and instance types are visible. From a GDPR audit perspective, an organization using Redshift Serverless cannot verify the specific infrastructure their personal data runs on, the number of nodes that processed a query, or the geographic distribution of query execution — the Serverless layer manages all of this opaquely.

The IAM policy required for Redshift Serverless access — redshift-serverless:ExecuteStatement — grants the ability to run arbitrary SQL against all data in the workgroup. In organizations where IAM policies are broadly assigned to analytics users, this creates an access control surface that is difficult to audit for GDPR data minimization compliance.

The Data Warehouse GDPR Risk Profile

Data warehouses create GDPR risks that differ structurally from operational databases or streaming systems.

Storage Limitation at Petabyte Scale

GDPR Article 5(1)(e) requires personal data to be stored for no longer than necessary. In operational databases, records are typically deleted when they are no longer needed for the transaction (a customer record is retained while the customer account is active). Data warehouses invert this: data is never deleted from the warehouse because historical completeness is the point.

A Redshift cluster that has been running for three years may contain:

The data warehouse is the graveyard of personal data that should have been deleted from operational systems but was never purged from analytics infrastructure. Under GDPR, the right to erasure (Article 17) applies to all copies of personal data, including copies in data warehouses and data lakes. Organizations that have fulfilled erasure requests in production databases but not in Redshift are non-compliant with every such request.

Purpose Limitation and Analytics Scope Creep

GDPR Article 5(1)(b) requires personal data to be collected for specified, explicit, and legitimate purposes. Data warehouses enable any analyst to run any query against all available data — a structural challenge for purpose limitation.

-- Query that likely exceeds the original purpose of data collection
-- Marketing analyst exploring health-correlated purchase patterns
SELECT
    FLOOR(DATEDIFF(year, date_of_birth, GETDATE()) / 10) * 10 AS age_decade,
    product_category,
    AVG(total_eur) AS avg_spend,
    COUNT(*) AS purchase_count
FROM fact_orders
JOIN dim_customers USING (customer_id)
JOIN dim_products USING (product_id)
WHERE product_category IN ('vitamins', 'supplements', 'medical-devices', 'home-health')
GROUP BY 1, 2
ORDER BY avg_spend DESC;

A marketing analyst running this query is correlating customer age with purchases in health-adjacent product categories. Whether health product purchases constitute health data under GDPR Article 9 is a legal question. But the key point is that a data warehouse where analysts can freely join any dimension against any fact table makes purpose limitation unenforceably broad.

The GDPR compliance framework for a Redshift deployment typically consists of a vague statement that data is used "for analytics and business improvement." This purpose description covers every query an analyst could run, which means it effectively covers no specific purpose at all.

The Right to Erasure at Warehouse Scale

Implementing the right to erasure in a columnar data warehouse like Redshift is operationally expensive:

-- Right to erasure implementation for a single customer in Redshift
-- Must be run across all tables containing personal data
BEGIN;

-- Delete from fact tables (expensive full scans without customer_id sort key)
DELETE FROM fact_events WHERE customer_id = 7842913;
DELETE FROM fact_orders WHERE customer_id = 7842913;
DELETE FROM fact_page_views WHERE customer_id = 7842913;
DELETE FROM fact_support_tickets WHERE customer_id = 7842913;
DELETE FROM fact_email_interactions WHERE customer_id = 7842913;

-- Update dimension table (cannot delete without breaking referential integrity)
UPDATE dim_customers
SET
    email = 'deleted@erasure.invalid',
    first_name = 'DELETED',
    last_name = 'DELETED',
    phone = NULL,
    date_of_birth = NULL,
    postal_code = NULL,
    city = NULL
WHERE customer_id = 7842913;

COMMIT;

-- Reclaim disk space (required to physically remove data from storage)
VACUUM DELETE ONLY fact_events;
VACUUM DELETE ONLY fact_orders;
-- ... one VACUUM per table

Redshift's columnar storage means DELETE operations mark rows as deleted but do not immediately reclaim disk space. The VACUUM command reclaims space but is computationally expensive and cannot run concurrently with large query loads. In a large Redshift cluster under continuous analytical query load, running VACUUM DELETE ONLY across a 10TB table to fulfill a single erasure request is operationally impractical.

Organizations that have deployed Redshift without designing erasure support into their ETL pipelines and table schemas frequently discover this problem when processing GDPR erasure requests: they can mark records as deleted in the production database, but physically removing the corresponding rows from Redshift at scale requires a pipeline that was never built.

EU-Sovereign Alternatives for Data Warehousing

ClickHouse (Self-Hosted, EU-Sovereign)

ClickHouse is the highest-performance open-source column-oriented database for analytical queries. It matches or exceeds Redshift query performance for typical analytics workloads and is self-hosted on EU-sovereign infrastructure.

-- ClickHouse equivalent of Redshift schema
CREATE TABLE dim_customers (
    customer_id     UInt64,
    email           String,
    first_name      String,
    last_name       String,
    date_of_birth   Date,
    phone           Nullable(String),
    country         FixedString(2),
    city            String,
    postal_code     String,
    created_at      DateTime,
    last_login      DateTime,
    account_status  LowCardinality(String),
    consent_email   UInt8,
    consent_tracking UInt8
) ENGINE = ReplacingMergeTree()
ORDER BY customer_id
SETTINGS storage_policy = 'eu_sovereign_storage';

CREATE TABLE fact_orders (
    order_id        UInt64,
    customer_id     UInt64,
    order_date      DateTime,
    total_eur       Decimal(10,2),
    payment_method  LowCardinality(String),
    shipping_city   String,
    shipping_country FixedString(2),
    items_count     UInt16,
    status          LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(order_date)
ORDER BY (customer_id, order_date)
TTL order_date + INTERVAL 3 YEAR;

The TTL order_date + INTERVAL 3 YEAR clause implements GDPR storage limitation directly in the table definition: rows older than three years are automatically deleted. ClickHouse's TTL mechanism enforces storage limitation at the storage engine level without requiring manual cleanup processes — a structural advantage over Redshift for GDPR compliance.

ClickHouse DELETE operations are immediate and do not require a VACUUM equivalent:

-- GDPR erasure in ClickHouse: immediate and complete
ALTER TABLE dim_customers DELETE WHERE customer_id = 7842913;
ALTER TABLE fact_orders DELETE WHERE customer_id = 7842913;
ALTER TABLE fact_events DELETE WHERE customer_id = 7842913;

ClickHouse mutations execute asynchronously but complete within minutes on typical warehouse-scale datasets, without the VACUUM overhead that makes Redshift erasure operationally impractical.

Deployed on EU-sovereign infrastructure via sota.io, ClickHouse provides a complete Redshift equivalent with all data remaining in EU jurisdiction under a single GDPR DPA.

Apache Doris (Self-Hosted, Real-Time Analytics)

Apache Doris is an open-source OLAP database with MySQL protocol compatibility, making it simpler to integrate with existing BI tooling than ClickHouse.

-- Apache Doris table with GDPR-aligned partitioning
CREATE TABLE fact_events (
    event_date      DATE NOT NULL,
    event_id        BIGINT NOT NULL,
    customer_id     BIGINT,
    session_id      VARCHAR(64),
    event_type      VARCHAR(50),
    page_url        VARCHAR(500),
    ip_address      VARCHAR(45),
    event_timestamp DATETIME
)
DUPLICATE KEY(event_date, event_id)
PARTITION BY RANGE(event_date)(
    PARTITION p2024 VALUES LESS THAN ('2025-01-01'),
    PARTITION p2025 VALUES LESS THAN ('2026-01-01'),
    PARTITION p2026 VALUES LESS THAN ('2027-01-01')
)
DISTRIBUTED BY HASH(customer_id) BUCKETS 32
PROPERTIES (
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "YEAR",
    "dynamic_partition.start" = "-3",
    "dynamic_partition.end" = "1",
    "dynamic_partition.prefix" = "p",
    "replication_num" = "3"
);

Dynamic partitioning with a -3 start value means partitions older than three years are automatically dropped — GDPR storage limitation enforced at the partition level. Dropping a partition in Doris is an immediate operation that removes all data in that partition without per-row overhead.

Doris supports direct connection from Apache Superset, Metabase, Grafana, and standard JDBC/ODBC clients — covering the BI visualization use cases that organizations typically address with Redshift and QuickSight together.

DuckDB (Lightweight Analytics for Smaller Deployments)

For organizations whose Redshift use is primarily data exploration and ad-hoc analytics at scales below 1TB, DuckDB provides a zero-infrastructure alternative.

import duckdb
import pandas as pd

# DuckDB analytics on Parquet files in EU-sovereign S3-compatible storage
conn = duckdb.connect()

# Register EU-sovereign MinIO as S3 source
conn.execute("""
    INSTALL httpfs;
    LOAD httpfs;
    SET s3_endpoint='minio.eu.internal:9000';
    SET s3_access_key_id='...';
    SET s3_secret_access_key='...';
    SET s3_use_ssl=true;
    SET s3_url_style='path';
""")

# Query Parquet files directly — no Redshift needed
result = conn.execute("""
    SELECT
        c.country,
        DATE_TRUNC('month', o.order_date) AS month,
        COUNT(*) AS order_count,
        SUM(o.total_eur) AS revenue
    FROM read_parquet('s3://eu-datalake/customers/*.parquet') c
    JOIN read_parquet('s3://eu-datalake/orders/*.parquet') o
        ON c.customer_id = o.customer_id
    WHERE o.order_date >= CURRENT_DATE - INTERVAL '12 months'
    GROUP BY 1, 2
    ORDER BY 1, 2
""").df()

DuckDB runs in-process on EU-sovereign compute, reading Parquet files directly from EU-sovereign object storage. There is no managed service, no control plane, and no US-jurisdiction data path. For analytics teams that primarily run reports and exploratory queries, DuckDB eliminates the Redshift dependency entirely.

Migration Strategy: Redshift to EU-Sovereign Data Warehouse

Phase 1: Inventory Personal Data in Redshift

Before migration, identify every table containing personal data:

import boto3

redshift_data = boto3.client('redshift-data', region_name='eu-central-1')

# Query Redshift information schema for tables with suspicious columns
response = redshift_data.execute_statement(
    ClusterIdentifier='analytics-cluster',
    Database='production',
    DbUser='admin',
    Sql="""
        SELECT
            table_schema,
            table_name,
            STRING_AGG(column_name, ', ') AS personal_data_columns,
            COUNT(*) AS pii_column_count
        FROM information_schema.columns
        WHERE column_name ILIKE ANY(ARRAY[
            '%email%', '%phone%', '%name%', '%address%',
            '%birth%', '%ip_address%', '%user_id%', '%customer_id%',
            '%ssn%', '%passport%', '%license%', '%iban%'
        ])
        GROUP BY 1, 2
        ORDER BY 4 DESC
    """
)

Phase 2: Implement Storage Limitation Before Migration

Redshift clusters often contain years of data with no expiry policy. Before migrating the data to an EU-sovereign warehouse, implement retention limits:

-- Delete behavioral data older than retention policy (run before migration)
DELETE FROM fact_events
WHERE event_timestamp < DATEADD(year, -2, GETDATE());

DELETE FROM fact_page_views
WHERE page_timestamp < DATEADD(year, -2, GETDATE());

-- Keep order history for 5 years (financial obligation basis)
DELETE FROM fact_orders
WHERE order_date < DATEADD(year, -5, GETDATE());

VACUUM DELETE ONLY fact_events;
VACUUM DELETE ONLY fact_page_views;
VACUUM DELETE ONLY fact_orders;

Running retention cleanup before migration reduces the data volume to be transferred and ensures that only data with a current legal basis exists in the new EU-sovereign warehouse.

Phase 3: Export and Transfer to EU-Sovereign Storage

# Export from Redshift to Parquet (EU-sovereign S3 bucket)
aws redshift-data execute-statement \
    --cluster-identifier analytics-cluster \
    --database production \
    --db-user admin \
    --sql "UNLOAD ('SELECT * FROM fact_orders WHERE order_date >= DATEADD(year,-5,GETDATE())')
           TO 's3://eu-sovereign-migration/orders/'
           IAM_ROLE 'arn:aws:iam::123456789012:role/redshift-export'
           FORMAT AS PARQUET
           ALLOWOVERWRITE;"

# Transfer from AWS S3 to EU-sovereign MinIO
mc mirror s3/eu-sovereign-migration/ \
         eu-minio/warehouse-migration/ \
         --region eu-central-1

Phase 4: Import to ClickHouse

# Load Parquet files into ClickHouse from EU-sovereign storage
clickhouse-client --query "
INSERT INTO fact_orders
SELECT * FROM s3(
    'http://minio.eu.internal:9000/warehouse-migration/orders/*.parquet',
    'access_key',
    'secret_key',
    'Parquet'
)
"

GDPR Compliance Checklist for Data Warehouse Migration

Before decommissioning Redshift:

Data Inventory and Classification:

Storage Limitation:

Erasure Support:

Access Control:

Post-Migration:

The Data Warehouse as GDPR Liability

Data warehouses occupy a unique position in the GDPR compliance landscape. They are the one system where personal data from every other system converges: transactional systems, behavioral analytics, support history, marketing interactions. They hold this data for years. They are queryable by any authorized analyst.

For organizations that have addressed GDPR compliance in operational systems — consent management at the frontend, DPA agreements with database providers, erasure pipelines in production databases — the data warehouse is often the last system to receive compliance attention and the one that contains the most historical personal data exposure.

Migrating a Redshift deployment to EU-sovereign infrastructure is not only about resolving the CLOUD Act jurisdictional risk. It is about gaining the ability to enforce storage limitation and erasure at the storage engine level, rather than as a manual operational process on top of infrastructure that charges by compute time for the DELETE operations that GDPR requires.

sota.io provides managed deployment of ClickHouse and other analytical databases on EU-sovereign compute, with automatic scaling and no US-jurisdiction cloud services in the data path. Your historical data stays in EU infrastructure under a single GDPR DPA.

This post is part of the sota.io AWS EU Alternative Series. Related posts: AWS Glue EU Alternative, AWS Kinesis EU Alternative, AWS SageMaker EU Alternative, AWS DynamoDB EU Alternative.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.