2026-04-30·13 min read·

AWS Glue EU Alternative 2026: ETL Pipelines, GDPR Compliance, and CLOUD Act Risk

Post #719 in the sota.io EU Compliance Series

AWS Glue is Amazon's serverless ETL service: a managed environment for extracting data from sources, transforming it with Python or Scala, and loading it into destinations. Glue eliminates the cluster management work that came with older Hadoop-based pipelines. Teams adopt it because it handles job scheduling, script execution, and data cataloging as a unified service.

What Glue also does, by design, is accumulate metadata at every layer of your data pipeline. Job scripts that encode your business logic and data structure. A Data Catalog that maps out every table, column, and data type across your systems — including the personal data your organization holds. Crawler histories that document what personal data existed and where. Job run logs that record every transformation applied to user records. Development endpoint sessions where data scientists may have loaded production samples.

All of this lives in your AWS account, under US jurisdiction, subject to CLOUD Act compelled disclosure.

For organizations operating under GDPR, the Data Catalog alone creates a significant compliance problem: it encodes a structured map of your personal data estate. Combined with job scripts that process that data, and run histories that prove what transformations occurred, Glue creates an audit trail that functions as a comprehensive record of your personal data processing — stored with a US corporation that cannot refuse a lawful government request.

What AWS Glue Stores

AWS Glue is not a single service. It consists of several interacting components, each with distinct data retention and jurisdictional implications.

The AWS Glue Data Catalog

The Data Catalog is Glue's central metadata repository. Every database, table, column, partition, and connection definition lives here. For organizations with personal data, the catalog encodes sensitive structural information:

{
  "DatabaseName": "production-users",
  "Tables": [
    {
      "Name": "user_profiles",
      "StorageDescriptor": {
        "Columns": [
          {"Name": "user_id", "Type": "string"},
          {"Name": "email", "Type": "string"},
          {"Name": "date_of_birth", "Type": "date"},
          {"Name": "ip_address_last_seen", "Type": "string"},
          {"Name": "gdpr_consent_timestamp", "Type": "timestamp"},
          {"Name": "data_deletion_requested", "Type": "boolean"}
        ],
        "Location": "s3://prod-data-lake/users/profiles/"
      }
    },
    {
      "Name": "user_behavioral_events",
      "StorageDescriptor": {
        "Columns": [
          {"Name": "user_id", "Type": "string"},
          {"Name": "session_id", "Type": "string"},
          {"Name": "event_type", "Type": "string"},
          {"Name": "page_url", "Type": "string"},
          {"Name": "device_fingerprint", "Type": "string"}
        ],
        "Location": "s3://prod-data-lake/users/events/"
      }
    }
  ]
}

The column names here — email, date_of_birth, ip_address_last_seen, gdpr_consent_timestamp, data_deletion_requested — constitute a precise inventory of personal data fields your organization maintains. Under GDPR Article 30, organizations must maintain records of processing activities. The Glue Data Catalog is effectively that record, stored under US jurisdiction.

The catalog also stores partition schemas. If your user_behavioral_events table is partitioned by dt=2026-04-30/country=DE, the catalog records not just the structure but evidence of what data segments exist. A government request does not need your underlying data if it can compel disclosure of the catalog: the catalog tells them what you have and where it is.

ETL Job Scripts

Glue job scripts are stored in S3, but the job definitions themselves — including the S3 path to the script, all job parameters, and the full execution configuration — are stored in Glue's managed service layer.

A typical Glue job definition exposes considerable processing logic:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'source_database', 'target_bucket'])
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Extract EU users with valid GDPR consent
users_df = glueContext.create_dynamic_frame.from_catalog(
    database="production-users",
    table_name="user_profiles"
)

# Filter for GDPR-consented users only
consented_users = Filter.apply(
    frame=users_df,
    f=lambda x: x["gdpr_consent_timestamp"] is not None
                and not x["data_deletion_requested"]
)

# Drop PII fields before writing to analytics store
anonymized = DropFields.apply(
    frame=consented_users,
    paths=["email", "date_of_birth", "ip_address_last_seen"]
)

glueContext.write_dynamic_frame.from_options(
    frame=anonymized,
    connection_type="s3",
    connection_options={"path": f"s3://{args['target_bucket']}/analytics/"}
)

This script reveals: the existence of a GDPR consent flag in your schema, the fact that some users have requested deletion, which fields you treat as PII, and the structure of your analytics pipeline. Under a CLOUD Act subpoena, AWS must hand this script to the requesting government — along with the metadata about every time it ran, with what parameters, on what data.

Job Run Histories and Execution Logs

Every Glue job execution creates a persistent run record:

{
  "JobName": "user-gdpr-anonymization-pipeline",
  "JobRunId": "jr_abc123",
  "StartedOn": "2026-04-30T02:00:00Z",
  "CompletedOn": "2026-04-30T02:14:22Z",
  "JobRunState": "SUCCEEDED",
  "Arguments": {
    "--source_database": "production-users",
    "--target_bucket": "analytics-prod-eu",
    "--processing_date": "2026-04-29",
    "--affected_users": "142857"
  },
  "DPUSeconds": 1240,
  "ExecutionTime": 862,
  "LogGroupName": "/aws-glue/jobs/output"
}

The --affected_users: 142857 argument is not hypothetical. Teams routinely pass cohort sizes and processing parameters as job arguments to support monitoring and debugging. A job run history becomes a timestamped log of exactly how many user records were processed, when, and by what pipeline. Over time this constitutes evidence of your complete data processing operations, stored indefinitely in Glue's managed layer.

CloudWatch Logs integration means the full job output — including any log lines that reference user IDs, error records, or processing statistics — is stored in AWS-managed infrastructure.

Glue Crawlers and Crawler Histories

Glue Crawlers automatically discover data in S3 or databases and populate the Data Catalog. Their configurations store:

{
  "CrawlerName": "prod-user-data-crawler",
  "Targets": {
    "S3Targets": [
      {"Path": "s3://prod-data-lake/users/"},
      {"Path": "s3://prod-data-lake/transactions/"},
      {"Path": "s3://prod-data-lake/behavioral-events/"}
    ],
    "JdbcTargets": [
      {
        "ConnectionName": "prod-rds-postgres",
        "Path": "prod-db/public/%"
      }
    ]
  },
  "Schedule": "cron(0 */6 * * ? *)",
  "SchemaChangePolicy": {
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "LOG"
  }
}

The crawler configuration reveals the topology of your data infrastructure: which S3 buckets contain user data, what databases are connected, how frequently data is updated. Crawler run histories record every schema change — meaning AWS holds a complete changelog of how your personal data structures evolved over time.

Glue Connections and Credential References

Glue Connections store the metadata for every data source your pipelines connect to:

{
  "ConnectionName": "prod-rds-users-db",
  "ConnectionType": "JDBC",
  "ConnectionProperties": {
    "JDBC_CONNECTION_URL": "jdbc:postgresql://users-db.cluster-xxx.eu-central-1.rds.amazonaws.com:5432/users",
    "USERNAME": "glue_etl_user",
    "PASSWORD": "{{secrets:prod/glue/rds-password}}"
  },
  "PhysicalConnectionRequirements": {
    "SubnetId": "subnet-0abc123",
    "SecurityGroupIdList": ["sg-0def456"]
  }
}

The connection stores the database hostname, username, VPC subnet configuration, and a reference to the secret holding the password. AWS can see the hostname of your production user database and the network topology needed to reach it. The actual password is in Secrets Manager — another AWS-managed service — but the reference chain is complete.

Glue Development Endpoints

Development endpoints are persistent clusters that data engineers use for interactive ETL development. They are a significant GDPR risk surface:

Development work often loads production data samples for testing transformations
Session artifacts — scripts, loaded DataFrames, intermediate outputs — may persist in the endpoint's storage
Endpoint configurations including VPC settings, S3 paths, and security group references are stored in Glue
CloudWatch Logs captures the full interactive session output, including any data printed during development

Development endpoints are explicitly scoped in GDPR Article 25 (data protection by design): production personal data should not flow into development systems. Glue development endpoints make this governance difficult because they are seamlessly integrated with the same Data Catalog and S3 access that production jobs use.

Glue Workflows and Triggers

Glue Workflows orchestrate multiple ETL jobs in sequence. Their configurations reveal processing dependencies:

{
  "WorkflowName": "gdpr-monthly-data-retention-pipeline",
  "DefaultRunProperties": {
    "retention_months": "24",
    "deletion_batch_size": "10000"
  },
  "Graph": {
    "Nodes": [
      {"Type": "TRIGGER", "Name": "monthly-schedule"},
      {"Type": "JOB", "Name": "identify-expired-user-records"},
      {"Type": "JOB", "Name": "delete-user-profiles"},
      {"Type": "JOB", "Name": "delete-user-behavioral-data"},
      {"Type": "JOB", "Name": "update-deletion-audit-log"}
    ]
  }
}

The workflow name gdpr-monthly-data-retention-pipeline and the job names confirm that you conduct monthly personal data deletion — a detail that may be relevant to regulatory investigations or civil litigation. Workflow run histories record every execution, including whether deletion jobs succeeded or failed.

The CLOUD Act Vector for ETL Pipelines

The CLOUD Act (18 U.S.C. § 2713) requires AWS to comply with US government demands for data held on behalf of US-based corporations, regardless of where the data is physically stored. The practical implication for Glue:

A US government investigative demand does not need your user data. It can compel disclosure of:

Your Data Catalog (complete inventory of personal data fields and locations)
Your ETL job scripts (how you process personal data)
Your job run histories (proof of what processing occurred, when, on how many records)
Your crawler configurations (complete map of where your data lives)
Your workflow definitions (your data governance procedures)

The catalog plus the job scripts constitute a structural description of your entire data processing operation. For a regulator, law enforcement agency, or civil plaintiff, this metadata can be more valuable than the underlying data itself: it proves what you had, how you used it, and when.

AWS cannot refuse a lawful CLOUD Act demand. Under GDPR Article 48, transfers based on foreign court judgments require authorization from a competent EU supervisory authority — an authorization that cannot be obtained retroactively and that AWS does not control.

GDPR Article 30 requires controllers to maintain records of processing activities, including:

Categories of personal data processed
Purposes of processing
Categories of recipients
Planned deletion timelines

AWS Glue's Data Catalog, job definitions, and workflow configurations together constitute a machine-readable Article 30 record — but stored with a US processor that is subject to compelled disclosure to US authorities without prior EU supervisory authority authorization.

Organizations using Glue for their core data pipelines are effectively maintaining their Article 30 compliance documentation inside a CLOUD Act-exposed system. This creates an ironic situation: the records you maintain to demonstrate GDPR compliance are themselves stored in a way that may violate GDPR Article 48.

AWS Glue EU Alternatives

The EU alternative landscape for managed ETL reflects the nature of Glue itself: it is a managed execution environment for data transformation logic. EU-sovereign alternatives are either self-hosted open-source tools deployed on EU infrastructure, or EU-HQ managed services.

Apache Airflow on EU Infrastructure

Apache Airflow is the most widely deployed open-source workflow orchestration platform. Self-hosted on EU infrastructure — a Kubernetes cluster on Hetzner, OVHcloud, or Scaleway — Airflow provides full control over every component:

DAG definitions (your pipeline code) stay in your own git repository
Task execution logs are stored in your own infrastructure
Connections and Variables are stored in your own PostgreSQL metastore
Run histories are in your own database

The operator model maps naturally onto Glue's job concepts. The S3ToGCSOperator, PythonOperator, and SparkSubmitOperator handle the same tasks as Glue ETL jobs. Airflow's managed offering from Astronomer has EU-hosted options.

dbt (Data Build Tool) on EU Infrastructure

dbt is the de facto standard for SQL-based transformation pipelines. The dbt Core project is entirely open-source and runs on any infrastructure. When deployed with a self-hosted metadata store and an EU-based data warehouse (ClickHouse, PostgreSQL on Hetzner), dbt provides:

Model definitions (SQL files) in your own git repository
Run artifacts (manifest.json, run_results.json) stored in your own object storage
Documentation generated locally without external service calls
Column-level lineage tracked in your own systems

dbt's catalog functionality maps onto Glue's Data Catalog for the transformation layer. Combined with a data catalog like Apache Atlas or OpenMetadata (self-hosted), dbt provides Glue-equivalent functionality entirely within EU-sovereign infrastructure.

Airbyte (Self-Hosted or EU-Hosted)

For the EL (Extract-Load) portion of ELT pipelines, Airbyte is the open-source alternative to Glue's source connectors. Self-hosted Airbyte:

Stores connection configurations in your own PostgreSQL database
Runs sync jobs on your own compute
Keeps sync logs in your own infrastructure
Has no mandatory external telemetry in the self-hosted version

Airbyte Cloud offers EU-region hosting. The EU-based managed option removes the operational overhead while keeping data within EU jurisdiction under EU-based processors.

Apache Spark on EU Kubernetes

For teams that need Glue's Spark-based processing capabilities, running Apache Spark directly on Kubernetes (using the Spark Operator) on EU-based clusters provides equivalent compute power:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: user-anonymization-pipeline
  namespace: data-pipelines
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: eu.gcr.io/your-project/spark-etl:latest
  mainApplicationFile: s3a://your-eu-bucket/scripts/anonymize_users.py
  sparkVersion: "3.5.0"
  driver:
    cores: 2
    memory: "4g"
  executor:
    cores: 4
    memory: "8g"
    instances: 10

Every component — the Spark job configuration, execution logs, and output — stays within your EU Kubernetes cluster. No metadata flows to a US-jurisdiction managed service.

Prefect and Dagster (Self-Hosted)

Prefect and Dagster are modern alternatives to Airflow with improved developer experience. Both offer open-source core versions that can be self-hosted entirely on EU infrastructure:

Prefect Server (open-source): self-hosted API, UI, and scheduler on your own Kubernetes
Dagster Open Source: complete self-hosted deployment with its own metadata database

Both support the full Glue feature set for workflow orchestration: job scheduling, dependency graphs, retry logic, and observability. Self-hosted deployments on EU compute keep all flow definitions, run histories, and artifacts within EU jurisdiction.

OpenMetadata for Data Catalog

The Glue Data Catalog's schema-tracking functionality has a direct EU-sovereign alternative in OpenMetadata. Self-hosted on EU infrastructure:

Column-level lineage tracked in your own Elasticsearch and MySQL stores
Data quality checks stored locally
Profiling results in your own database
Governance workflows in your own infrastructure

OpenMetadata connects to your existing databases, data warehouses, and ETL tools to build a catalog of your data assets — without any schema or metadata leaving your EU infrastructure.

Migration Path from AWS Glue

Migrating from Glue to EU-sovereign alternatives follows a natural sequence based on risk reduction priority:

Step 1: Migrate the Data Catalog (Highest Risk)

The Data Catalog is the component most sensitive to CLOUD Act compelled disclosure — it encodes your complete personal data inventory. Deploy OpenMetadata or Apache Atlas on EU infrastructure and populate it by crawling your existing data sources directly. This removes the most sensitive structural information from AWS.

Step 2: Migrate ETL Job Scripts to a Version-Controlled Repository

Glue job scripts should move to a git repository on EU-hosted infrastructure (Gitea on Hetzner, or an EU-region GitLab instance). This ensures your processing logic is not stored in AWS's managed layer.

Step 3: Replace Glue Jobs with Airflow DAGs or dbt Models

Rewrite Glue ETL jobs as Airflow DAGs or dbt models. Most Glue PySpark transformations have direct equivalents in Spark-on-Kubernetes. SQL-based transformations typically require minimal changes to run as dbt models.

# Glue job (before)
from awsglue.transforms import Filter, DropFields
consented = Filter.apply(frame=users_df, f=lambda x: x["gdpr_consent"])
anonymized = DropFields.apply(frame=consented, paths=["email", "dob"])

# Airflow + PySpark equivalent (after)
consented = users_df.filter(col("gdpr_consent").isNotNull())
anonymized = consented.drop("email", "dob")

Step 4: Replace Glue Crawlers with Direct Connection Profiling

Replace Glue Crawlers with direct database introspection in your self-hosted catalog tool. OpenMetadata connects directly to PostgreSQL, MySQL, and S3-compatible storage to extract schema information — without routing schema data through AWS.

Step 5: Migrate Development Endpoints to JupyterHub

Replace Glue Development Endpoints with a self-hosted JupyterHub on EU infrastructure. Configure JupyterHub with network policies that enforce separation between development and production data, implementing the Article 25 data minimization requirements that Glue endpoints make structurally difficult.

Compliance Checklist for AWS Glue Users

If you continue using AWS Glue while assessing alternatives, these measures reduce but do not eliminate GDPR risk:

Immediate:

Audit Data Catalog contents — identify all personal data field names and remove non-essential schema documentation
Review job script parameters — remove cohort sizes, user counts, and sensitive processing details from job arguments
Audit development endpoints — confirm no production personal data was loaded; rotate any credentials used in dev sessions
Enable Glue Data Catalog encryption — encrypt catalog metadata at rest using AWS KMS (does not address CLOUD Act jurisdiction, but addresses unauthorized access)

Short-term:

Implement least-privilege IAM for Glue roles — limit which catalogs and S3 paths each job can access
Set CloudWatch Logs retention policies — reduce log retention to minimum required for operational purposes
Document Glue as a data processor in GDPR Article 30 records — include it in your Records of Processing Activities
Map Glue's data flows for your Data Protection Officer — the catalog, job scripts, and run histories collectively constitute processing records under Article 30

Architecture:

Evaluate self-hosted alternatives for the Data Catalog specifically — this is the highest-risk component
Consider hybrid: EU-hosted orchestration (Airflow) calling EU-sovereign compute, with Glue only for batch jobs that process genuinely non-personal data

Choosing an EU-Sovereign ETL Stack

The right combination of tools depends on your pipeline characteristics:

Use Case	Recommended EU Alternative
SQL-based transformations	dbt Core on EU infrastructure
Complex Python/Spark ETL	Airflow + Spark on EU Kubernetes
Data ingestion/replication	Airbyte self-hosted or EU-hosted
Data catalog / discovery	OpenMetadata self-hosted
Workflow orchestration	Airflow or Dagster self-hosted
Interactive development	JupyterHub on EU infrastructure
Full replacement	Airflow + dbt + Airbyte + OpenMetadata

The combination of Airflow (orchestration), dbt (transformation), Airbyte (ingestion), and OpenMetadata (catalog) covers the full Glue feature set with every component deployable on EU infrastructure under your control.

Deploying on EU-Sovereign Infrastructure

Each of these tools requires managed compute and storage. Deploying the full stack on EU-native infrastructure — Hetzner, OVHcloud, or Scaleway — requires a managed Kubernetes environment for the orchestration layer, managed PostgreSQL for metadata stores, and managed object storage for data and artifacts.

sota.io provides managed deployment of containerized data infrastructure on EU-sovereign compute, covering the runtime layer for self-hosted ETL tools. Your Airflow instance, dbt execution environment, and Airbyte connectors can run on EU infrastructure without US-jurisdiction cloud services in the critical data path.

Conclusion

AWS Glue's serverless ETL model is operationally convenient but creates a GDPR compliance problem that is structural, not configurational. The Data Catalog, job scripts, run histories, and crawler configurations together constitute a comprehensive record of your personal data processing — stored under US jurisdiction, subject to CLOUD Act compelled disclosure without prior EU supervisory authority authorization.

This creates a specific irony for GDPR Article 30 compliance: organizations use Glue to build the records of processing activities that GDPR requires, but store those records with a US processor that cannot refuse lawful US government demands.

EU-sovereign alternatives — Apache Airflow, dbt, Airbyte, OpenMetadata, and Apache Spark — are mature, production-ready tools used by organizations processing petabytes of data. They require more operational overhead than managed Glue, but that overhead can be reduced substantially by deploying on managed EU-sovereign infrastructure.

The question is not whether self-hosted tools can match Glue's functionality. They can. The question is whether the operational convenience of AWS Glue justifies the jurisdictional exposure it creates for your organization's most sensitive structural information about personal data.

This post is part of the sota.io AWS EU Alternative Series. Related posts: AWS SageMaker EU Alternative, AWS CloudFormation EU Alternative, AWS EventBridge EU Alternative.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Start free — no credit card View pricing