AWS Glue EU Alternative 2026: ETL Pipelines, GDPR Compliance, and CLOUD Act Risk
Post #719 in the sota.io EU Compliance Series
AWS Glue is Amazon's serverless ETL service: a managed environment for extracting data from sources, transforming it with Python or Scala, and loading it into destinations. Glue eliminates the cluster management work that came with older Hadoop-based pipelines. Teams adopt it because it handles job scheduling, script execution, and data cataloging as a unified service.
What Glue also does, by design, is accumulate metadata at every layer of your data pipeline. Job scripts that encode your business logic and data structure. A Data Catalog that maps out every table, column, and data type across your systems — including the personal data your organization holds. Crawler histories that document what personal data existed and where. Job run logs that record every transformation applied to user records. Development endpoint sessions where data scientists may have loaded production samples.
All of this lives in your AWS account, under US jurisdiction, subject to CLOUD Act compelled disclosure.
For organizations operating under GDPR, the Data Catalog alone creates a significant compliance problem: it encodes a structured map of your personal data estate. Combined with job scripts that process that data, and run histories that prove what transformations occurred, Glue creates an audit trail that functions as a comprehensive record of your personal data processing — stored with a US corporation that cannot refuse a lawful government request.
What AWS Glue Stores
AWS Glue is not a single service. It consists of several interacting components, each with distinct data retention and jurisdictional implications.
The AWS Glue Data Catalog
The Data Catalog is Glue's central metadata repository. Every database, table, column, partition, and connection definition lives here. For organizations with personal data, the catalog encodes sensitive structural information:
{
"DatabaseName": "production-users",
"Tables": [
{
"Name": "user_profiles",
"StorageDescriptor": {
"Columns": [
{"Name": "user_id", "Type": "string"},
{"Name": "email", "Type": "string"},
{"Name": "date_of_birth", "Type": "date"},
{"Name": "ip_address_last_seen", "Type": "string"},
{"Name": "gdpr_consent_timestamp", "Type": "timestamp"},
{"Name": "data_deletion_requested", "Type": "boolean"}
],
"Location": "s3://prod-data-lake/users/profiles/"
}
},
{
"Name": "user_behavioral_events",
"StorageDescriptor": {
"Columns": [
{"Name": "user_id", "Type": "string"},
{"Name": "session_id", "Type": "string"},
{"Name": "event_type", "Type": "string"},
{"Name": "page_url", "Type": "string"},
{"Name": "device_fingerprint", "Type": "string"}
],
"Location": "s3://prod-data-lake/users/events/"
}
}
]
}
The column names here — email, date_of_birth, ip_address_last_seen, gdpr_consent_timestamp, data_deletion_requested — constitute a precise inventory of personal data fields your organization maintains. Under GDPR Article 30, organizations must maintain records of processing activities. The Glue Data Catalog is effectively that record, stored under US jurisdiction.
The catalog also stores partition schemas. If your user_behavioral_events table is partitioned by dt=2026-04-30/country=DE, the catalog records not just the structure but evidence of what data segments exist. A government request does not need your underlying data if it can compel disclosure of the catalog: the catalog tells them what you have and where it is.
ETL Job Scripts
Glue job scripts are stored in S3, but the job definitions themselves — including the S3 path to the script, all job parameters, and the full execution configuration — are stored in Glue's managed service layer.
A typical Glue job definition exposes considerable processing logic:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_database', 'target_bucket'])
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Extract EU users with valid GDPR consent
users_df = glueContext.create_dynamic_frame.from_catalog(
database="production-users",
table_name="user_profiles"
)
# Filter for GDPR-consented users only
consented_users = Filter.apply(
frame=users_df,
f=lambda x: x["gdpr_consent_timestamp"] is not None
and not x["data_deletion_requested"]
)
# Drop PII fields before writing to analytics store
anonymized = DropFields.apply(
frame=consented_users,
paths=["email", "date_of_birth", "ip_address_last_seen"]
)
glueContext.write_dynamic_frame.from_options(
frame=anonymized,
connection_type="s3",
connection_options={"path": f"s3://{args['target_bucket']}/analytics/"}
)
This script reveals: the existence of a GDPR consent flag in your schema, the fact that some users have requested deletion, which fields you treat as PII, and the structure of your analytics pipeline. Under a CLOUD Act subpoena, AWS must hand this script to the requesting government — along with the metadata about every time it ran, with what parameters, on what data.
Job Run Histories and Execution Logs
Every Glue job execution creates a persistent run record:
{
"JobName": "user-gdpr-anonymization-pipeline",
"JobRunId": "jr_abc123",
"StartedOn": "2026-04-30T02:00:00Z",
"CompletedOn": "2026-04-30T02:14:22Z",
"JobRunState": "SUCCEEDED",
"Arguments": {
"--source_database": "production-users",
"--target_bucket": "analytics-prod-eu",
"--processing_date": "2026-04-29",
"--affected_users": "142857"
},
"DPUSeconds": 1240,
"ExecutionTime": 862,
"LogGroupName": "/aws-glue/jobs/output"
}
The --affected_users: 142857 argument is not hypothetical. Teams routinely pass cohort sizes and processing parameters as job arguments to support monitoring and debugging. A job run history becomes a timestamped log of exactly how many user records were processed, when, and by what pipeline. Over time this constitutes evidence of your complete data processing operations, stored indefinitely in Glue's managed layer.
CloudWatch Logs integration means the full job output — including any log lines that reference user IDs, error records, or processing statistics — is stored in AWS-managed infrastructure.
Glue Crawlers and Crawler Histories
Glue Crawlers automatically discover data in S3 or databases and populate the Data Catalog. Their configurations store:
{
"CrawlerName": "prod-user-data-crawler",
"Targets": {
"S3Targets": [
{"Path": "s3://prod-data-lake/users/"},
{"Path": "s3://prod-data-lake/transactions/"},
{"Path": "s3://prod-data-lake/behavioral-events/"}
],
"JdbcTargets": [
{
"ConnectionName": "prod-rds-postgres",
"Path": "prod-db/public/%"
}
]
},
"Schedule": "cron(0 */6 * * ? *)",
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
}
}
The crawler configuration reveals the topology of your data infrastructure: which S3 buckets contain user data, what databases are connected, how frequently data is updated. Crawler run histories record every schema change — meaning AWS holds a complete changelog of how your personal data structures evolved over time.
Glue Connections and Credential References
Glue Connections store the metadata for every data source your pipelines connect to:
{
"ConnectionName": "prod-rds-users-db",
"ConnectionType": "JDBC",
"ConnectionProperties": {
"JDBC_CONNECTION_URL": "jdbc:postgresql://users-db.cluster-xxx.eu-central-1.rds.amazonaws.com:5432/users",
"USERNAME": "glue_etl_user",
"PASSWORD": "{{secrets:prod/glue/rds-password}}"
},
"PhysicalConnectionRequirements": {
"SubnetId": "subnet-0abc123",
"SecurityGroupIdList": ["sg-0def456"]
}
}
The connection stores the database hostname, username, VPC subnet configuration, and a reference to the secret holding the password. AWS can see the hostname of your production user database and the network topology needed to reach it. The actual password is in Secrets Manager — another AWS-managed service — but the reference chain is complete.
Glue Development Endpoints
Development endpoints are persistent clusters that data engineers use for interactive ETL development. They are a significant GDPR risk surface:
- Development work often loads production data samples for testing transformations
- Session artifacts — scripts, loaded DataFrames, intermediate outputs — may persist in the endpoint's storage
- Endpoint configurations including VPC settings, S3 paths, and security group references are stored in Glue
- CloudWatch Logs captures the full interactive session output, including any data printed during development
Development endpoints are explicitly scoped in GDPR Article 25 (data protection by design): production personal data should not flow into development systems. Glue development endpoints make this governance difficult because they are seamlessly integrated with the same Data Catalog and S3 access that production jobs use.
Glue Workflows and Triggers
Glue Workflows orchestrate multiple ETL jobs in sequence. Their configurations reveal processing dependencies:
{
"WorkflowName": "gdpr-monthly-data-retention-pipeline",
"DefaultRunProperties": {
"retention_months": "24",
"deletion_batch_size": "10000"
},
"Graph": {
"Nodes": [
{"Type": "TRIGGER", "Name": "monthly-schedule"},
{"Type": "JOB", "Name": "identify-expired-user-records"},
{"Type": "JOB", "Name": "delete-user-profiles"},
{"Type": "JOB", "Name": "delete-user-behavioral-data"},
{"Type": "JOB", "Name": "update-deletion-audit-log"}
]
}
}
The workflow name gdpr-monthly-data-retention-pipeline and the job names confirm that you conduct monthly personal data deletion — a detail that may be relevant to regulatory investigations or civil litigation. Workflow run histories record every execution, including whether deletion jobs succeeded or failed.
The CLOUD Act Vector for ETL Pipelines
The CLOUD Act (18 U.S.C. § 2713) requires AWS to comply with US government demands for data held on behalf of US-based corporations, regardless of where the data is physically stored. The practical implication for Glue:
A US government investigative demand does not need your user data. It can compel disclosure of:
- Your Data Catalog (complete inventory of personal data fields and locations)
- Your ETL job scripts (how you process personal data)
- Your job run histories (proof of what processing occurred, when, on how many records)
- Your crawler configurations (complete map of where your data lives)
- Your workflow definitions (your data governance procedures)
The catalog plus the job scripts constitute a structural description of your entire data processing operation. For a regulator, law enforcement agency, or civil plaintiff, this metadata can be more valuable than the underlying data itself: it proves what you had, how you used it, and when.
AWS cannot refuse a lawful CLOUD Act demand. Under GDPR Article 48, transfers based on foreign court judgments require authorization from a competent EU supervisory authority — an authorization that cannot be obtained retroactively and that AWS does not control.
GDPR Article 30: Records of Processing Activities
GDPR Article 30 requires controllers to maintain records of processing activities, including:
- Categories of personal data processed
- Purposes of processing
- Categories of recipients
- Planned deletion timelines
AWS Glue's Data Catalog, job definitions, and workflow configurations together constitute a machine-readable Article 30 record — but stored with a US processor that is subject to compelled disclosure to US authorities without prior EU supervisory authority authorization.
Organizations using Glue for their core data pipelines are effectively maintaining their Article 30 compliance documentation inside a CLOUD Act-exposed system. This creates an ironic situation: the records you maintain to demonstrate GDPR compliance are themselves stored in a way that may violate GDPR Article 48.
AWS Glue EU Alternatives
The EU alternative landscape for managed ETL reflects the nature of Glue itself: it is a managed execution environment for data transformation logic. EU-sovereign alternatives are either self-hosted open-source tools deployed on EU infrastructure, or EU-HQ managed services.
Apache Airflow on EU Infrastructure
Apache Airflow is the most widely deployed open-source workflow orchestration platform. Self-hosted on EU infrastructure — a Kubernetes cluster on Hetzner, OVHcloud, or Scaleway — Airflow provides full control over every component:
- DAG definitions (your pipeline code) stay in your own git repository
- Task execution logs are stored in your own infrastructure
- Connections and Variables are stored in your own PostgreSQL metastore
- Run histories are in your own database
The operator model maps naturally onto Glue's job concepts. The S3ToGCSOperator, PythonOperator, and SparkSubmitOperator handle the same tasks as Glue ETL jobs. Airflow's managed offering from Astronomer has EU-hosted options.
dbt (Data Build Tool) on EU Infrastructure
dbt is the de facto standard for SQL-based transformation pipelines. The dbt Core project is entirely open-source and runs on any infrastructure. When deployed with a self-hosted metadata store and an EU-based data warehouse (ClickHouse, PostgreSQL on Hetzner), dbt provides:
- Model definitions (SQL files) in your own git repository
- Run artifacts (manifest.json, run_results.json) stored in your own object storage
- Documentation generated locally without external service calls
- Column-level lineage tracked in your own systems
dbt's catalog functionality maps onto Glue's Data Catalog for the transformation layer. Combined with a data catalog like Apache Atlas or OpenMetadata (self-hosted), dbt provides Glue-equivalent functionality entirely within EU-sovereign infrastructure.
Airbyte (Self-Hosted or EU-Hosted)
For the EL (Extract-Load) portion of ELT pipelines, Airbyte is the open-source alternative to Glue's source connectors. Self-hosted Airbyte:
- Stores connection configurations in your own PostgreSQL database
- Runs sync jobs on your own compute
- Keeps sync logs in your own infrastructure
- Has no mandatory external telemetry in the self-hosted version
Airbyte Cloud offers EU-region hosting. The EU-based managed option removes the operational overhead while keeping data within EU jurisdiction under EU-based processors.
Apache Spark on EU Kubernetes
For teams that need Glue's Spark-based processing capabilities, running Apache Spark directly on Kubernetes (using the Spark Operator) on EU-based clusters provides equivalent compute power:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: user-anonymization-pipeline
namespace: data-pipelines
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: eu.gcr.io/your-project/spark-etl:latest
mainApplicationFile: s3a://your-eu-bucket/scripts/anonymize_users.py
sparkVersion: "3.5.0"
driver:
cores: 2
memory: "4g"
executor:
cores: 4
memory: "8g"
instances: 10
Every component — the Spark job configuration, execution logs, and output — stays within your EU Kubernetes cluster. No metadata flows to a US-jurisdiction managed service.
Prefect and Dagster (Self-Hosted)
Prefect and Dagster are modern alternatives to Airflow with improved developer experience. Both offer open-source core versions that can be self-hosted entirely on EU infrastructure:
- Prefect Server (open-source): self-hosted API, UI, and scheduler on your own Kubernetes
- Dagster Open Source: complete self-hosted deployment with its own metadata database
Both support the full Glue feature set for workflow orchestration: job scheduling, dependency graphs, retry logic, and observability. Self-hosted deployments on EU compute keep all flow definitions, run histories, and artifacts within EU jurisdiction.
OpenMetadata for Data Catalog
The Glue Data Catalog's schema-tracking functionality has a direct EU-sovereign alternative in OpenMetadata. Self-hosted on EU infrastructure:
- Column-level lineage tracked in your own Elasticsearch and MySQL stores
- Data quality checks stored locally
- Profiling results in your own database
- Governance workflows in your own infrastructure
OpenMetadata connects to your existing databases, data warehouses, and ETL tools to build a catalog of your data assets — without any schema or metadata leaving your EU infrastructure.
Migration Path from AWS Glue
Migrating from Glue to EU-sovereign alternatives follows a natural sequence based on risk reduction priority:
Step 1: Migrate the Data Catalog (Highest Risk)
The Data Catalog is the component most sensitive to CLOUD Act compelled disclosure — it encodes your complete personal data inventory. Deploy OpenMetadata or Apache Atlas on EU infrastructure and populate it by crawling your existing data sources directly. This removes the most sensitive structural information from AWS.
Step 2: Migrate ETL Job Scripts to a Version-Controlled Repository
Glue job scripts should move to a git repository on EU-hosted infrastructure (Gitea on Hetzner, or an EU-region GitLab instance). This ensures your processing logic is not stored in AWS's managed layer.
Step 3: Replace Glue Jobs with Airflow DAGs or dbt Models
Rewrite Glue ETL jobs as Airflow DAGs or dbt models. Most Glue PySpark transformations have direct equivalents in Spark-on-Kubernetes. SQL-based transformations typically require minimal changes to run as dbt models.
# Glue job (before)
from awsglue.transforms import Filter, DropFields
consented = Filter.apply(frame=users_df, f=lambda x: x["gdpr_consent"])
anonymized = DropFields.apply(frame=consented, paths=["email", "dob"])
# Airflow + PySpark equivalent (after)
consented = users_df.filter(col("gdpr_consent").isNotNull())
anonymized = consented.drop("email", "dob")
Step 4: Replace Glue Crawlers with Direct Connection Profiling
Replace Glue Crawlers with direct database introspection in your self-hosted catalog tool. OpenMetadata connects directly to PostgreSQL, MySQL, and S3-compatible storage to extract schema information — without routing schema data through AWS.
Step 5: Migrate Development Endpoints to JupyterHub
Replace Glue Development Endpoints with a self-hosted JupyterHub on EU infrastructure. Configure JupyterHub with network policies that enforce separation between development and production data, implementing the Article 25 data minimization requirements that Glue endpoints make structurally difficult.
Compliance Checklist for AWS Glue Users
If you continue using AWS Glue while assessing alternatives, these measures reduce but do not eliminate GDPR risk:
Immediate:
- Audit Data Catalog contents — identify all personal data field names and remove non-essential schema documentation
- Review job script parameters — remove cohort sizes, user counts, and sensitive processing details from job arguments
- Audit development endpoints — confirm no production personal data was loaded; rotate any credentials used in dev sessions
- Enable Glue Data Catalog encryption — encrypt catalog metadata at rest using AWS KMS (does not address CLOUD Act jurisdiction, but addresses unauthorized access)
Short-term:
- Implement least-privilege IAM for Glue roles — limit which catalogs and S3 paths each job can access
- Set CloudWatch Logs retention policies — reduce log retention to minimum required for operational purposes
- Document Glue as a data processor in GDPR Article 30 records — include it in your Records of Processing Activities
- Map Glue's data flows for your Data Protection Officer — the catalog, job scripts, and run histories collectively constitute processing records under Article 30
Architecture:
- Evaluate self-hosted alternatives for the Data Catalog specifically — this is the highest-risk component
- Consider hybrid: EU-hosted orchestration (Airflow) calling EU-sovereign compute, with Glue only for batch jobs that process genuinely non-personal data
Choosing an EU-Sovereign ETL Stack
The right combination of tools depends on your pipeline characteristics:
| Use Case | Recommended EU Alternative |
|---|---|
| SQL-based transformations | dbt Core on EU infrastructure |
| Complex Python/Spark ETL | Airflow + Spark on EU Kubernetes |
| Data ingestion/replication | Airbyte self-hosted or EU-hosted |
| Data catalog / discovery | OpenMetadata self-hosted |
| Workflow orchestration | Airflow or Dagster self-hosted |
| Interactive development | JupyterHub on EU infrastructure |
| Full replacement | Airflow + dbt + Airbyte + OpenMetadata |
The combination of Airflow (orchestration), dbt (transformation), Airbyte (ingestion), and OpenMetadata (catalog) covers the full Glue feature set with every component deployable on EU infrastructure under your control.
Deploying on EU-Sovereign Infrastructure
Each of these tools requires managed compute and storage. Deploying the full stack on EU-native infrastructure — Hetzner, OVHcloud, or Scaleway — requires a managed Kubernetes environment for the orchestration layer, managed PostgreSQL for metadata stores, and managed object storage for data and artifacts.
sota.io provides managed deployment of containerized data infrastructure on EU-sovereign compute, covering the runtime layer for self-hosted ETL tools. Your Airflow instance, dbt execution environment, and Airbyte connectors can run on EU infrastructure without US-jurisdiction cloud services in the critical data path.
Conclusion
AWS Glue's serverless ETL model is operationally convenient but creates a GDPR compliance problem that is structural, not configurational. The Data Catalog, job scripts, run histories, and crawler configurations together constitute a comprehensive record of your personal data processing — stored under US jurisdiction, subject to CLOUD Act compelled disclosure without prior EU supervisory authority authorization.
This creates a specific irony for GDPR Article 30 compliance: organizations use Glue to build the records of processing activities that GDPR requires, but store those records with a US processor that cannot refuse lawful US government demands.
EU-sovereign alternatives — Apache Airflow, dbt, Airbyte, OpenMetadata, and Apache Spark — are mature, production-ready tools used by organizations processing petabytes of data. They require more operational overhead than managed Glue, but that overhead can be reduced substantially by deploying on managed EU-sovereign infrastructure.
The question is not whether self-hosted tools can match Glue's functionality. They can. The question is whether the operational convenience of AWS Glue justifies the jurisdictional exposure it creates for your organization's most sensitive structural information about personal data.
This post is part of the sota.io AWS EU Alternative Series. Related posts: AWS SageMaker EU Alternative, AWS CloudFormation EU Alternative, AWS EventBridge EU Alternative.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.