2026-05-26·5 min read·sota.io Team

EU Data Lakehouse Comparison Finale 2026: Databricks vs Snowflake vs dbt Cloud vs Starburst

Post #1302 in the sota.io EU Cyber Compliance Series — EU-DATA-LAKEHOUSE-TOOLS #5/5 COMPLETE

EU Data Lakehouse Comparison 2026 — CLOUD Act Risk Matrix all four vendors

Your data lakehouse is the most architecturally intimate software in your stack. It knows your schema. It knows your query patterns. It knows which analysts access which tables and when. It stores the lineage of every transformation from raw data to the dashboards your executives review. It holds the model weights trained on your customer data and the semantic definitions that describe what your key business metrics mean.

It is also, in the case of the four platforms in this series, a Delaware C-Corporation under US CLOUD Act jurisdiction.

This finale closes the EU Data Lakehouse Tools series. We ran Databricks, Snowflake, dbt Cloud, and Starburst through the same five-dimension CLOUD Act exposure framework used across every series on this blog. All four score differently. All four are subject to 18 U.S.C. § 2703 in ways that EU contractual protections — SCCs, DPAs, BCRs — cannot fully resolve. And there is an EU-native sovereign stack that scores 0/25 on every dimension.

CLOUD Act Exposure Matrix — All Four Vendors

Dimension	Databricks	Snowflake	Starburst	dbt Cloud
D1: Corporate Jurisdiction	5/5	5/5	5/5	5/5
D2: Data Routing Architecture	4/5	3/5	3/5	3/5
D3: Subprocessors	4/5	4/5	3/5	3/5
D4: Personnel Access	3/5	4/5	3/5	2/5
D5: Legal Framework	4/5	3/5	2/5	2/5
Total CLOUD Act Score	20/25	19/25	16/25	15/25

Higher score = higher CLOUD Act exposure risk. EU-native alternative stack (Apache Spark + Apache Iceberg + DuckDB + MinIO + Apache Atlas): 0/25.

All four vendors score D1 at 5/5. This is the inescapable baseline: every platform in this series is incorporated as a Delaware C-Corporation, placing them unconditionally within CLOUD Act §2703(d) domestic subpoena jurisdiction. No SCC addendum, no data processing agreement, no contractual "EU data only" promise changes this.

The differentiation happens in D2 through D5 — and the differentiation matters.

Why Data Lakehouses Create a Unique CLOUD Act Risk Category

Data lakehouses occupy a structurally different position in your stack than compute services, databases, or storage. They are not passive infrastructure. They are intelligence infrastructure: they accumulate metadata about your data operations at a scale and depth that no other category of enterprise software matches.

Consider what Databricks knows about a European bank after 12 months of operation:

Every SQL query run against every table, with timestamps and user attribution (Unity Catalog access logs)
Column-level lineage from raw transaction tables through risk model feature engineering to regulatory reporting dashboards
Model registry metadata identifying which ML models were trained on which transaction data, by which teams, with which evaluation metrics
Delta Sharing configuration revealing which data products were shared with which counterparties

This is the Lakehouse Intelligence Paradox: the platform you use to process EU personal data accumulates a comprehensive architectural map of your data operations under US jurisdiction — without storing the underlying personal data. A CLOUD Act compelled disclosure targeting your lakehouse provider yields not your customer records, but something potentially more sensitive: the complete operational intelligence of how you process them.

This distinction matters for GDPR risk management in ways that the individual vendor analyses could not fully address. It requires understanding the category-level exposure before evaluating vendor-level differentiation.

The Lakehouse Intelligence Paradox — Category-Level Risk

The four meta-risks unique to the data lakehouse category:

1. The Schema Fingerprint Problem

Every data lakehouse stores detailed schema information: table names, column names, data types, column descriptions, classification labels. For organisations processing EU personal data, these schemas are not neutral metadata. A schema containing columns named health_condition, political_affiliation, sexual_orientation, salary_band, or bank_account_iban reveals the existence and structure of special-category personal data processing under GDPR Art.9 — even when the underlying data never leaves EU servers.

EDPB Recommendation 01/2020 on supplementary transfer measures (updated 2022) establishes that metadata about personal data processing can constitute personal data when it enables inferences about the data subjects described by the underlying data. A CLOUD Act subpoena targeting Unity Catalog, Snowflake's metadata service, or dbt Cloud's Discovery API yields schema fingerprints that may trigger GDPR Art.9 obligations — without a single row of actual personal data being disclosed.

2. The Query Intelligence Trap

All four platforms store query history. Databricks via Unity Catalog audit logs. Snowflake via Query History (90 days, accessible to US personnel). Starburst via Galaxy coordinator query logs. dbt Cloud via run history and compiled SQL artefacts.

Query history is not neutral operational data. For a European bank: query patterns reveal which risk models ran before which credit decisions. For a European hospital: query patterns reveal which clinical research pipelines ran, on which patient cohorts, before which publications. For a European insurer: query patterns reveal the behavioral analytics logic behind premium adjustments.

This intelligence is compellable without EU judicial authorisation under CLOUD Act §2703(d). It requires only a US government certification that the records sought are "relevant and material to an ongoing criminal investigation."

3. The ML Model Boundary Problem

Databricks MLflow Model Registry and Snowflake Model Registry store trained model weights. CNIL (France, 2024) and the German DSK (October 2024 working paper) have each articulated the position that ML model weights trained on EU personal data constitute "derived personal data" under GDPR Art.4(1) when the models can be used to infer information about individual data subjects.

If this interpretation is confirmed — and the trend in EU data protection guidance is toward confirming it — then CLOUD Act compelled disclosure of model weights trained on EU personal data constitutes a cross-border data transfer subject to Chapter V GDPR. Standard Contractual Clauses do not cover US law enforcement compelled disclosure. There is no validated supplementary measure that makes CLOUD Act model weight disclosure compliant.

Neither Databricks nor Snowflake has published a legal opinion addressing this exposure. The regulatory gap exists today.

4. The DORA Concentration Risk Layer

For EU financial entities subject to DORA Art.28, data lakehouses qualify as "critical ICT third-party service providers" when they support regulatory reporting, risk management analytics, or fraud detection. This creates a second regulatory exposure layer beyond GDPR: DORA requires contractual documentation of data processing locations, audit rights, and concentration risk limits.

Databricks (20/25) and Snowflake (19/25) are the most widely adopted lakehouse platforms in EU financial services. The CLOUD Act exposure of these platforms creates an unresolved intersection: DORA requires organisations to document where their critical ICT third-party services process data; CLOUD Act jurisdiction means this documentation must acknowledge US-side processing of operational metadata even for EU-region deployed instances.

Vendor Profiles — What Differentiates the Scores

Databricks — 20/25 (Highest in Series)

Databricks leads the series in CLOUD Act exposure because it leads in metadata surface area. Unity Catalog, MLflow, Delta Sharing, and the Databricks SQL Warehouse collectively accumulate more structured metadata about EU data operations than any other platform in this analysis.

Why D2: 4/5 (highest): Databricks offers EU data residency options (AWS eu-central-1, Azure northeurope/westeurope). However, Unity Catalog Control Plane infrastructure — the component that stores lineage graphs, access control policies, and metadata search indexes — runs in US-based Databricks-controlled infrastructure regardless of where your workspace is deployed. This is the highest D2 score in the series because the gap between "data in EU" and "metadata in US" is most pronounced in the Databricks architecture.

Why D5: 4/5 (highest): Databricks has invested more in EU compliance documentation than the other vendors: EU-US Data Privacy Framework registration, published sub-processor disclosures, and detailed guidance on data residency configuration. This documentation does not reduce CLOUD Act exposure — §2703(d) applies regardless — but it creates a higher standard of legal process hygiene. The D5 score of 4/5 reflects this relative investment in compliance documentation, not a reduced risk profile.

Named Risk Patterns:

Unity Catalog Lineage Fingerprint: Column-level lineage stored in US-controlled Control Plane reveals the complete feature engineering pipeline, credit model architecture, and clinical research data flows of EU deployments.
MLflow Model Registry CLOUD Act Trap: Model weights trained on EU personal data stored in US-jurisdiction Model Registry create a potential GDPR Art.4(1) cross-border transfer issue under emerging EU regulatory guidance.
Delta Sharing Protocol Cross-Border Leakage: Delta Sharing configuration and recipient metadata stored in Databricks infrastructure reveal the cross-organisational data sharing relationships of EU data providers.

Snowflake — 19/25

One point below Databricks, Snowflake's 19/25 reflects a tri-cloud architecture that creates broad subprocessor exposure (D3: 4/5) combined with highest-in-series personnel access risk (D4: 4/5).

Why D4: 4/5 (highest): Snowflake's US-based platform engineering teams have administrative-level access to EU tenant account metadata — virtual warehouse configurations, account usage views, query history tables accessible via the SNOWFLAKE database. This is not exceptional to Snowflake — all SaaS platforms have operational access requirements — but Snowflake's architecture provides less contractual isolation for EU metadata than Databricks' Unity Catalog permission boundaries.

Why D2: 3/5: The Data Cloud's global metadata service — which powers Data Marketplace, Data Sharing, and cross-account collaboration — maintains a global registry of account metadata, dataset listings, and sharing configurations that operates outside EU data residency guarantees. Even organisations that deploy exclusively in EU regions participate in this global metadata service as a precondition of account creation.

Named Risk Patterns:

Data Cloud Metadata Broker Pattern: Global Data Marketplace infrastructure stores schema descriptions and sharing configurations for EU data products under US jurisdiction.
Snowpark ML Training Artifact Trap: Feature stores and model weights trained on EU personal data stored in US-jurisdiction Model Registry.
Tri-Cloud Control Plane Exposure: Simultaneous AWS/Azure/GCP subprocessor relationships create three independent CLOUD Act exposure vectors.

Starburst Galaxy — 16/25

Starburst's 16/25 reflects a different risk profile than the storage platforms. Galaxy does not store your EU data — it federates queries across data sources that do. The CLOUD Act risk is concentrated in the query orchestration layer.

Why D5: 2/5 (lowest shared with dbt Cloud): Starburst provides standard SCCs without the supplementary compliance documentation that Databricks and Snowflake have published. No BCR, no published CLOUD Act challenge history, no contractual commitment to notify EU customers of government access requests (beyond what GDPR Art.28 requires and US law permits).

Why the score is lower despite the query exposure: The absence of direct data storage reduces D2 and D3 scores. Galaxy processes queries against your EU data sources but does not maintain persistent copies of EU data. The CLOUD Act exposure is real — coordinator logs, query plans, result metadata — but structurally narrower than platforms that store EU data directly.

Named Risk Patterns:

Trino Query Federation Control Plane Pattern: US-jurisdiction Galaxy coordinator parses every SQL query, stores query plans, caches result metadata for EU data operations.
Ranger Policy Propagation Gap: RLS policies managed in US infrastructure, with propagation latency creating GDPR Art.5(1)(f) compliance windows.
Iceberg REST Catalog Exposure: Table schemas, partition specs, and snapshot history for Galaxy-managed Iceberg tables stored in US-jurisdiction catalog.

dbt Cloud — 15/25 (Lowest in Series)

dbt Cloud scores lowest primarily because of the D4 gap. The product architecture requires US-based dbt Labs engineers to have structured, queryable access to SQL transformation code, compiled artefacts, and lineage metadata — not because of inadequate security controls, but because that access is how the product works.

Why D4: 2/5 (lowest): Unlike storage platforms that can implement technical data-plane isolation, dbt Cloud's product value derives from intelligence about transformation logic. The Discovery API, dbt Semantic Layer, and lineage graphs require centralised processing of SQL code that operates on EU personal data. Technical isolation mechanisms available for raw storage (VPC peering, bring-your-own-key) do not apply to transformation intelligence.

Why D5: 2/5: dbt Cloud's compliance documentation is the least developed in the series: standard DPA with SCCs Annex II, no EU-specific legal guarantees, no published sub-processor list with jurisdictions, no BCR in progress. This reflects dbt Labs' growth stage and the lower regulatory scrutiny historically applied to transformation tools compared to storage platforms.

Named Risk Patterns:

Semantic Layer Metadata Pattern: Business metric definitions over EU personal data stored in US-hosted Semantic Layer API reveal data architecture to CLOUD Act subpoena.
Transformation Lineage Audit Trail: Column-level lineage stored in Discovery API constitutes a structural fingerprint of EU data processing pipelines.
Compiled SQL Artefact Exposure: dbt Cloud stores compiled SQL artefacts — the actual transformation code applied to EU data — in US-jurisdiction infrastructure accessible to dbt Labs engineers.

The EU-Native Sovereign Lakehouse Stack — 0/25

The EU-native sovereign alternative to all four platforms is a composable open-source stack with no US corporate dependencies:

Component	EU-Native Option	Jurisdiction	CLOUD Act Score
Query Engine	Apache Spark (self-hosted)	Apache Software Foundation (501c3, non-profit, US-registered but open governance)	0/25
Table Format	Apache Iceberg (self-hosted)	Apache Software Foundation	0/25
Interactive Analytics	DuckDB (CWI Amsterdam, NL)	CWI — Dutch national research institute	0/25
Object Storage	MinIO (self-hosted on Hetzner/OVH)	AGPL-3.0, self-hosted — no SaaS dependency	0/25
Metadata Catalog	Apache Atlas (self-hosted)	Apache Software Foundation	0/25
Data Transformation	dbt Core OSS (self-hosted)	Apache-2.0, self-hosted	0/25

DuckDB deserves particular attention. Created at CWI (Centrum Wiskunde & Informatica, Amsterdam, Netherlands), DuckDB is the only major analytical query engine with EU institutional origins. Its MIT licence, embedded architecture (no network-accessible service required), and exceptional analytical performance for medium-scale workloads make it the highest-sovereignty option for interactive lakehouse analytics. CWI is a Dutch national research institute — not a Delaware C-Corp, not a US VC-backed startup, not subject to CLOUD Act jurisdiction.

The composability advantage: Each component in the EU-native stack can be deployed independently on Hetzner Germany, OVHcloud France, or any EU-jurisdiction cloud infrastructure. No component requires calling home to US-controlled servers. Metadata — lineage, catalogs, schemas — stays entirely within your deployment environment under EU jurisdiction.

The honest trade-off: The EU-native stack requires significantly more engineering investment to operate than a managed service. Databricks and Snowflake exist because running distributed lakehouse infrastructure at scale is hard. DuckDB + MinIO + Apache Atlas is not a product — it is an engineering discipline. For organisations with strong platform engineering teams and genuine sovereignty requirements (healthcare, financial services, defence-adjacent), this trade-off is rational. For organisations that cannot staff the platform engineering overhead, accepting a managed service with documented CLOUD Act exposure and appropriate risk management processes may be the right operational decision.

Series Summary: The EU Data Lakehouse Sovereignty Spectrum

Platform	Score	Primary Risk Driver	On-Prem Option	EU Entity
Databricks	20/25	Unity Catalog Control Plane (US) + broadest metadata accumulation	Limited	No
Snowflake	19/25	Tri-cloud subprocessors + highest personnel access score	No	No
Starburst Galaxy	16/25	Query federation coordinator under US jurisdiction	Starburst Enterprise (self-hosted)	No
dbt Cloud	15/25	Transformation intelligence accessible to US engineers	dbt Core (OSS)	No
EU-native OSS stack	0/25	No US corporate dependency	Self-hosted by design	Partial (DuckDB/CWI NL)

The on-premises escape hatch is partial and expensive. Starburst Enterprise and dbt Core provide genuine on-premises alternatives that eliminate the SaaS CLOUD Act exposure. Databricks and Snowflake do not offer equivalent on-premises paths for their full platform capabilities — on-premises Databricks Community Edition lacks Unity Catalog's enterprise capabilities; Snowflake has no on-premises option.

The score ordering is not arbitrary. The 20→19→16→15 progression reflects the volume and architectural depth of metadata that each platform accumulates about EU data operations. Databricks accumulates the most (full lineage + ML models + Delta Sharing). Snowflake accumulates broadly (Data Cloud global metadata + Snowpark). Starburst accumulates deeply but narrowly (query history only). dbt Cloud accumulates narrowly (transformation artefacts).

EU Data Lakehouse Procurement Framework

For EU organisations evaluating these platforms under GDPR Art.46 transfer mechanisms and DORA Art.28 third-party risk requirements:

Step 1: Classify your workload sovereignty requirements.

Not all lakehouse workloads carry equal GDPR risk. Special-category data (health, biometric, political affiliation — Art.9) under automated decision-making (Art.22) represents maximum sovereignty risk. Financial risk modelling under DORA creates a second regulatory dimension. Analytics on non-personal aggregate data carries the lowest risk.

Step 2: Map the metadata surface, not just the data.

Your GDPR Art.30 Records of Processing Activities must document not just where your personal data is stored, but where metadata about its processing is stored. For Databricks: Unity Catalog Control Plane (US). For Snowflake: Data Cloud metadata service (US). For Starburst: Galaxy coordinator logs (US). For dbt Cloud: Discovery API (US). This documentation is required for accountability under Art.5(2) and must be accurate.

Step 3: Evaluate the on-premises path for high-sovereignty workloads.

Starburst Enterprise (self-hosted) and dbt Core (OSS) provide clean US-jurisdiction exits for organisations that can absorb the operational overhead. For special-category data processing or DORA-critical workloads, the engineering investment in self-hosted alternatives is typically justified by the risk reduction.

Step 4: Implement supplementary technical measures for managed services.

For organisations that adopt managed services despite CLOUD Act exposure:

Deploy EU-region instances where available (reduces D2 risk)
Implement data-plane encryption with customer-managed keys (reduces raw data exposure — does not affect metadata exposure)
Negotiate contractual notification clauses (CLOUD Act gag orders can prevent notification — negotiate what law permits)
Document residual CLOUD Act risk explicitly in DPIA (Art.35) and RoPA (Art.30) — regulators increasingly expect this documentation

Step 5: Consider the EU-native stack for new greenfield deployments.

For organisations starting new data platform initiatives in 2026, the composable EU-native stack (Spark + Iceberg + DuckDB + MinIO + Atlas) is worth evaluating before committing to managed services. The market has matured: managed Spark on EU clouds (EMR in eu-central-1, Dataproc in EU regions) provides a middle path between full sovereignty and full managed service convenience.

Closing: What the EU Data Lakehouse Series Revealed

This five-post series covered the four dominant commercial data lakehouse platforms. The finding is consistent across all four: the architecture of data lakehouse platforms — accumulating comprehensive metadata about EU data operations — creates a CLOUD Act exposure profile that is qualitatively different from traditional databases or storage services.

The key insight is the inversion of the conventional risk model: a CLOUD Act subpoena targeting your object storage yields raw data, which may or may not identify individuals. A CLOUD Act subpoena targeting your lakehouse platform yields the architectural intelligence of how your organisation processes EU personal data — schemas, lineage, query patterns, model metadata. This intelligence often reveals more about your data operations than the underlying data itself.

EU organisations deploying data lakehouses in 2026 have three rational responses:

Accept the risk with documentation: Deploy commercial platforms with explicit CLOUD Act exposure documented in DPIA/RoPA and residual risk accepted at DPO/Board level.
Mitigate through architecture: Deploy commercial platforms with maximum available technical isolation (EU regions, customer-managed keys, on-premises components for high-risk workloads).
Eliminate through sovereignty: Deploy the EU-native composable stack for workloads where the CLOUD Act exposure is unacceptable, accepting the engineering overhead.

All three responses can be rational depending on organisational capability, regulatory exposure, and risk tolerance. What is not rational — and what EU data protection authorities are increasingly scrutinising — is deploying US-controlled data lakehouse infrastructure over special-category personal data without documented awareness of the CLOUD Act exposure or a DPIA that addresses it.

This post closes the EU Data Lakehouse Tools series. The next series examines EU Cloud Infrastructure Providers and the sovereignty spectrum from AWS/Azure/GCP to EU-native alternatives like Hetzner, Scaleway, OVHcloud, and IONOS.

All CLOUD Act scores in this series use the same five-dimension framework published at the start of this blog. Scores reflect the platforms' standard SaaS configurations as of May 2026. On-premises or custom enterprise deployments may modify D2–D4 scores. D1 (corporate jurisdiction) and D5 (legal framework) are unaffected by deployment configuration.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing