2026-05-01·11 min read·

AWS DataBrew EU Alternative 2026: No-Code Data Preparation and the GDPR Recipe Problem

Post #752 in the sota.io EU Compliance Series

AWS Glue DataBrew is Amazon's visual, no-code data preparation service designed for data analysts and data scientists who need to clean, normalize, and transform datasets without writing code. DataBrew provides a point-and-click interface where users connect to data sources in S3, the Glue Data Catalog, Redshift, RDS, Aurora, or other databases, apply a sequence of over 350 built-in transformations organized as a visual "recipe," and run jobs that output cleaned data to S3 or other destinations. The service handles large-scale data transformation without requiring infrastructure management or Apache Spark expertise.

Organizations use DataBrew for three primary use cases: cleaning raw data ingested from operational systems before loading into data warehouses, profiling datasets to understand their statistical characteristics and identify quality issues, and applying repeatable standardized transformations to data pipelines. DataBrew's data profiling capability automatically computes column statistics (missing value rates, cardinality, value distributions, correlation matrices) and runs built-in PII detection to flag columns likely to contain personal data. These capabilities make DataBrew attractive for teams building GDPR-compliant data pipelines — the PII detection can help identify which columns require masking or pseudonymization.

The personal data implications are significant. DataBrew by design processes data that organizations are trying to clean and prepare for analysis. That data frequently contains personal information: customer records from CRM systems, employee data from HR databases, transaction records from payment systems, healthcare data from EHR systems, or behavioral data from web analytics pipelines. DataBrew is specifically positioned for scenarios where analysts who lack engineering skills need to transform data containing PII — making it a high-exposure service from a GDPR perspective.

The structural GDPR problem: DataBrew is a fully managed AWS service. Recipe definitions, dataset connection configurations, data profile reports, job run logs, and PII detection results are stored in AWS service state managed by Amazon.com, Inc., a US company subject to CLOUD Act compelled disclosure. The recipe — the documented sequence of transformations applied to personal data — is not stored in infrastructure the EU organization controls. Neither are the data profile reports that map which columns contain what categories of personal data. Under GDPR Article 30, organizations must maintain records of processing activities. The processing records that DataBrew automatically generates about how personal data is transformed are held by a US company with compelled-disclosure obligations to US government authorities.

What AWS DataBrew Actually Does

DataBrew operates on a project model. A project connects a dataset (a pointer to a data source) to a recipe (an ordered list of transformation steps). The dataset defines where source data lives — an S3 path, a Glue Data Catalog table backed by S3 Parquet files, a Redshift schema, an RDS table, or an Aurora database. The recipe contains individual transformation steps: remove duplicates, filter rows, rename columns, change data types, apply regex substitutions, mask PII values, calculate derived columns, aggregate data, join datasets, or apply custom SQL expressions.

DataBrew jobs execute recipes against datasets at scale. A job reads the source data (which may contain terabytes of personal data), applies the recipe steps, and writes transformed output to a specified S3 destination. Job run details — input path, output path, row counts, error counts, transformation step results — are logged to CloudWatch Logs and available through the DataBrew API.

The data profiling capability runs a separate analysis job. A profile job reads the dataset and computes statistics for each column: minimum, maximum, mean, standard deviation, quartile distribution, most frequent values, distinct value count, missing value rate, and PII detection results. DataBrew's PII detection uses pattern matching and machine learning to classify columns as likely to contain names, email addresses, phone numbers, postal addresses, national ID numbers, credit card numbers, IP addresses, dates of birth, or other sensitive data types. Profile results are stored as a JSON report in a specified S3 destination and are also available through the DataBrew API and console.

Six GDPR Exposure Points Specific to AWS DataBrew

1. Data Profile Reports as GDPR Art.30 PII Intelligence Under US Jurisdiction

DataBrew data profiling generates a comprehensive intelligence report about personal data in the dataset. The profile report documents: which columns were detected as containing names, email addresses, national ID numbers, phone numbers, postal codes, dates of birth, IP addresses, or financial account numbers; value distribution statistics that can reveal the geographic or demographic composition of the dataset; correlation matrices that can reveal relationships between sensitive attributes; and cardinality metrics that indicate whether columns are effectively pseudonymous.

For EU organizations, this profile report is the technical documentation of what personal data exists and where it is located within the dataset — directly relevant to GDPR Article 30 Records of Processing Activities. Under GDPR Article 35, data protection impact assessments require systematic description of the categories of personal data being processed. DataBrew's profile report is precisely this systematic description, automatically generated and stored in two locations: an S3 bucket (which the organization controls) and AWS DataBrew service state (which Amazon.com controls under US jurisdiction). The PII detection results — a structured map of which columns contain which categories of Art.9 sensitive data — are directly accessible to Amazon.com and subject to CLOUD Act compelled disclosure.

2. Recipe Definitions as Art.25 Privacy-by-Design Records Under AWS Control

GDPR Article 25 requires Data Protection by Design: implementing technical measures to ensure processing is carried out in accordance with GDPR principles. For data engineering teams, this means documenting how PII is masked, pseudonymized, aggregated, or removed during data transformation. DataBrew recipes are precisely this documentation — each recipe step records a privacy-by-design decision: mask column email with SHA-256 hash, remove column national_id, replace column full_name with [REDACTED], filter out rows where age < 18.

These recipe definitions are stored in AWS DataBrew service state under Amazon.com's management. They are visible in the DataBrew console, retrievable via the DataBrew API, and subject to CLOUD Act compelled disclosure. This means the documented record of an EU organization's privacy engineering decisions — which data is masked, how, and in which pipeline — is held by a US company with compelled-disclosure obligations to US government authorities.

A DPA (Data Protection Authority) examining whether GDPR Art.25 privacy-by-design requirements were implemented would need to review exactly this documentation. The organization can export recipe JSON from DataBrew, but the authoritative record remains in AWS service state. Under a CLOUD Act order, Amazon could be compelled to disclose recipe definitions revealing what privacy controls are (or are not) in place.

3. Job Run Logs as GDPR Art.30 Processing Records

DataBrew job runs generate logs documenting transformation execution. Each job run log includes: the input dataset path (the S3 URI or Glue table containing personal data), the output path (where transformed data was written), the number of rows read, the number of rows written, error counts, recipe steps applied, and execution timestamps. These logs are written to CloudWatch Logs and retained in the AWS DataBrew service.

Under GDPR Article 30, organizations must maintain records of processing activities including the categories of personal data processed, the recipients of personal data, transfers to third countries, and where possible the envisaged time limits for erasure. DataBrew job run logs constitute Article 30 processing records: they document when personal data was processed, by what transformation pipeline, from which source, to which destination. These Article 30 records are stored in CloudWatch Logs under Amazon.com's management, subject to CLOUD Act compelled disclosure.

For organizations subject to GDPR accountability requirements, this means the mandatory processing records documentation is partially held by a US company that can be compelled to disclose it to US government authorities — with no obligation to notify the EU organization before disclosure.

4. PII Detection Results as Art.9 Special Category Inference Intelligence

DataBrew's built-in PII detection goes beyond identifying obvious PII columns. For healthcare datasets, the profiler can identify columns likely to contain medical record numbers, diagnosis codes, prescription data, or clinical notes. For HR datasets, it can identify salary grades, performance ratings, or disciplinary action codes. The detection system applies heuristics based on column names, value patterns, and statistical characteristics.

The detection results create a structured classification of an EU organization's data assets: which datasets contain which categories of personal data, with what prevalence and distribution. Under GDPR Article 9, processing of special categories of personal data (health data, genetic data, biometric data, data revealing racial or ethnic origin, political opinions, religious beliefs, sexual orientation) requires explicit legal basis. DataBrew's PII detection results effectively produce an Art.9 data map — documentation of where special category data exists across the organization's data assets.

This Art.9 data map is stored in DataBrew profile reports accessible to Amazon.com. For a US government authority issuing a CLOUD Act order seeking to understand what sensitive data an organization processes, DataBrew profile reports would provide a structured answer: column diagnosis_code contains medical data in 98% of rows, column ethnic_origin contains racial data, column religious_affiliation contains beliefs data.

5. Dataset Connection Configurations as Art.46 Transfer Architecture Maps

DataBrew datasets define connections to data sources. A dataset configuration specifies: the S3 path or prefix containing source data (revealing internal data storage architecture), the Glue Data Catalog database and table (revealing the data catalog structure), the Redshift cluster hostname, database, and schema (revealing database infrastructure), or the RDS/Aurora endpoint and table (revealing operational database architecture). These connection configurations document where personal data lives across the organization's infrastructure.

Under GDPR Article 46, transfers of personal data to third countries require appropriate safeguards. DataBrew dataset configurations that reference S3 paths containing personal data, Redshift tables containing customer records, or RDS tables containing employee data create a documented map of where personal data is stored that is held in AWS DataBrew service state under US jurisdiction. The architecture map of an EU organization's personal data storage — including internal hostnames, schema names, and data paths — is accessible to Amazon.com.

Attackers who gain access to DataBrew service state (through a compromised AWS account, an insider threat, or a compelled disclosure) can use dataset configurations to reconstruct the complete architecture of where personal data is stored, without ever accessing the data itself.

6. Custom Transformation Expressions as Art.28 Processor Audit Evidence

DataBrew recipes can include custom SQL expressions and formula-based transformations. These custom transformations become documentation of how personal data is processed: CASE WHEN age < 18 THEN 'minor' ELSE 'adult' END, REGEXP_REPLACE(email, '@.*', '@[REDACTED]'), IF(health_condition IS NOT NULL, '[PHI_REDACTED]', health_condition). Each expression documents a processing decision about personal data.

Under GDPR Article 28, data processors must provide sufficient guarantees about technical measures for processing personal data on behalf of controllers. For EU organizations using DataBrew to process personal data on behalf of clients (acting as processors), the custom transformation expressions in DataBrew recipes constitute evidence of processing operations that would be examined in Art.28 audits. This evidence is stored in AWS service state, not in infrastructure the processor controls.

More practically: if a data protection authority requests documentation of how a specific category of personal data was handled in a data pipeline, the authoritative documentation is in DataBrew recipe state stored by Amazon.com. The organization can export recipe JSON, but the audit trail — which version of a recipe was active when, who modified it — remains in AWS service state.

EU-Native Data Preparation Alternatives

dbt (data build tool) — EU-Deployable Transformation-as-Code

Best for: SQL-based data transformation teams wanting version control, testing, and documentation

dbt is an open-source data transformation framework that runs SQL-based transformations on existing data warehouses. Instead of point-and-click recipes, dbt transformations are SQL files stored in Git repositories — giving organizations full control over transformation code, versioning, and audit history.

For GDPR compliance, dbt's model layer is critical: transformation logic is defined in SQL files the organization stores in its own Git repository, not in vendor service state. The transformation code, documentation, and lineage graphs are generated and stored on EU infrastructure. dbt Cloud (the hosted version) can be configured to use EU data; dbt Core runs entirely on organization-controlled infrastructure.

-- models/staging/stg_customers.sql
-- GDPR: mask PII at ingestion layer, store downstream as pseudonymized
SELECT
    {{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_key,
    -- Email masked to domain-only for analytics
    REGEXP_REPLACE(email, '^[^@]+', '[masked]') AS email_domain,
    -- Name replaced with generated identifier
    'Customer_' || customer_id::text AS display_name,
    -- Non-PII attributes preserved
    country_code,
    created_at,
    subscription_tier
FROM {{ source('raw', 'customers') }}
# models/staging/schema.yml — dbt GDPR documentation
models:
  - name: stg_customers
    description: "Customer staging model with PII masked per GDPR Art.25"
    columns:
      - name: customer_key
        description: "GDPR Art.5(1)(e) pseudonym — surrogate key replacing customer_id"
      - name: email_domain
        description: "Domain-only email for analytics — personal part masked"
      - name: display_name
        description: "Non-identifying display name — original name not retained"

dbt integrates with EU-native data warehouses: ClickHouse, PostgreSQL, DuckDB, Databricks on EU infrastructure. Transformation logic stays in Git. Audit trails are Git commit history. No AWS service state involved.

Apache Spark + Great Expectations — EU-Deployable Quality + Transformation

Best for: Large-scale data pipelines requiring both transformation and data quality validation

Great Expectations is an open-source data quality framework that can replace DataBrew's data profiling capability. It generates detailed data documentation (called "Data Docs") including column statistics, distribution histograms, value samples, and quality validation results. Critically, Great Expectations runs on infrastructure the organization controls and stores results in configured storage — S3 on EU infrastructure, local filesystem, or PostgreSQL.

import great_expectations as gx

context = gx.get_context()

# Add S3 datasource on EU infrastructure
datasource = context.sources.add_spark_s3(
    name="eu_customer_data",
    bucket="my-eu-bucket",
    boto3_options={"endpoint_url": "https://s3.eu-central-1.amazonaws.com"}
)

# Define expectations equivalent to DataBrew PII detection
suite = context.add_expectation_suite("customer_data_gdpr_suite")

# Validate that PII columns exist and have expected format
validator = context.get_validator(
    batch_request=...,
    expectation_suite_name="customer_data_gdpr_suite"
)

# Column-level validations for GDPR Art.25 compliance documentation
validator.expect_column_values_to_match_regex(
    column="email",
    regex=r"^\[masked\]@",
    meta={"gdpr": "Art.25 — email masked at ingestion, only domain retained"}
)
validator.expect_column_to_exist("national_id_hash",
    meta={"gdpr": "Art.5(1)(e) pseudonymization — original national_id not stored"})
validator.expect_column_values_to_be_null("national_id_original",
    meta={"gdpr": "Art.17 erasure — original national_id removed after pseudonymization"})

validator.save_expectation_suite()

# Run validation — results stored to EU infrastructure, not AWS service state
checkpoint_result = context.run_checkpoint(checkpoint_name="gdpr_compliance_checkpoint")

Great Expectations stores all validation results and data documentation in organization-controlled storage. Profile reports, expectation suites, and validation results never leave the organization's infrastructure. Integration with Apache Spark allows processing at DataBrew scale.

OpenRefine — Open-Source Visual Data Cleaning

Best for: Analysts needing visual, no-code data cleaning without AWS dependency

OpenRefine is a standalone open-source application for data cleaning that provides a visual interface similar to DataBrew. OpenRefine runs locally or on EU infrastructure — there is no cloud component, no service state in vendor systems, and no data leaving the organization's control.

OpenRefine's key GDPR advantage over DataBrew: transformation history ("operations log") is stored as JSON exportable from OpenRefine, not in vendor service state. The complete record of what transformations were applied to data can be stored in organization-controlled infrastructure, committed to Git, and included in GDPR Art.30 documentation without relying on vendor API access.

OpenRefine can be deployed as a containerized service on EU infrastructure:

# docker-compose.yml — OpenRefine on EU infrastructure
version: '3.8'
services:
  openrefine:
    image: felixlohmeier/openrefine:3.7.9
    ports:
      - "3333:3333"
    volumes:
      - ./openrefine-data:/data
      - ./openrefine-workspace:/workspace
    environment:
      - REFINE_DATA_DIR=/data
      - REFINE_WORKSPACE=/workspace
    command: ["/app/refine", "-p", "3333", "-d", "/data", "-w", "/workspace"]
    # Deploy on EU VPS — all data stays on EU infrastructure

OpenRefine's clustering algorithms (key collision, nearest neighbor) are particularly effective for deduplicating customer name variants, address normalization, and entity resolution — common data preparation tasks on personal data.

DuckDB + Python — Embedded Analytics for EU Data Preparation

Best for: Data engineers wanting SQL-based transformation without managed service dependency

DuckDB is an embedded analytical database that processes data entirely within the running process. There is no server, no cloud service, and no vendor service state. DuckDB can read S3 Parquet files directly, apply complex SQL transformations, and write output without any data leaving the processing environment.

import duckdb
import boto3

# Process S3 data on EU infrastructure without managed service
conn = duckdb.connect()

# Configure S3 access to EU bucket
conn.execute("""
    SET s3_region = 'eu-central-1';
    SET s3_endpoint = 's3.eu-central-1.amazonaws.com';
    SET s3_access_key_id = ?;
    SET s3_secret_access_key = ?;
""", [access_key, secret_key])

# GDPR-compliant transformation — runs entirely in process, no service state
conn.execute("""
    COPY (
        SELECT
            -- Art.5(1)(e) pseudonymization
            md5(customer_id::text) as customer_key,
            -- Art.25 data minimization — retain only necessary fields
            country_code,
            subscription_tier,
            date_trunc('month', created_at) as signup_month,
            -- PII removed
            -- email: removed (no legitimate analytics purpose)
            -- name: removed (no legitimate analytics purpose)
            -- national_id: removed (Art.9 sensitive)
        FROM read_parquet('s3://eu-data-bucket/raw/customers/*.parquet')
        WHERE gdpr_consent = true  -- Art.6 lawful basis filter
    ) TO 's3://eu-data-bucket/processed/customers_pseudonymized.parquet'
    (FORMAT PARQUET, CODEC ZSTD)
""")

DuckDB's transformation logic lives in Python files stored in Git. There is no service state, no vendor API, no PII detection report stored outside the organization. The complete transformation pipeline is auditable from source code. DuckDB can process multi-gigabyte datasets in memory on a single EU server without requiring Spark cluster infrastructure.

Apache NiFi — EU-Deployable Data Flow Automation

Best for: Organizations needing DataBrew-scale automation with EU-deployable infrastructure

Apache NiFi is an enterprise data flow automation platform that can replace DataBrew for automated data transformation pipelines. NiFi provides a visual drag-and-drop interface for building data transformation flows — similar to DataBrew's project interface — but runs entirely on organization-controlled EU infrastructure.

NiFi handles the full DataBrew use case: read from S3 or databases, apply transformation processors (filter rows, rename fields, apply regex, mask PII, compute hash functions, join datasets), write to output destinations. NiFi flows are stored as XML configuration files deployable from Git, not in vendor service state.

For GDPR Art.30 compliance, NiFi's provenance tracking records which processors handled which FlowFiles — a detailed processing record stored in NiFi's provenance repository on EU infrastructure.

Choosing a DataBrew Alternative

ScenarioRecommended Alternative
SQL-based transformation teamdbt Core on EU infrastructure
Data quality profiling and validationGreat Expectations on EU infrastructure
Visual no-code interface for analystsOpenRefine EU-deployed
Embedded analytics, no managed serviceDuckDB + Python
Enterprise pipeline automationApache NiFi EU-deployed
Multi-source data integrationApache Airflow + dbt on EU infrastructure

The common thread: each alternative stores transformation definitions, profile reports, and processing records in organization-controlled infrastructure. The Art.30 documentation, Art.25 privacy-by-design records, and Art.9 data maps that DataBrew automatically generates in AWS service state are instead generated on EU infrastructure where only the organization and their designated EU-jurisdiction processors have access.

Migrating from DataBrew to EU Infrastructure

Step 1: Export DataBrew recipe JSON. DataBrew recipes can be exported via the AWS CLI:

aws glue get-recipe --name my-customer-data-recipe --output json > recipe-export.json

The exported JSON documents every transformation step. This is the starting point for converting to dbt SQL models or Great Expectations expectation suites.

Step 2: Map DataBrew transformations to dbt SQL. DataBrew's built-in transformations have direct SQL equivalents. Common mappings:

DataBrew: REMOVE_DUPLICATES → dbt: {{ dbt_utils.deduplicate(...) }}
DataBrew: RENAME_COLUMN → dbt: SELECT old_name AS new_name
DataBrew: CHANGE_DATA_TYPE → dbt: SELECT CAST(col AS INTEGER)
DataBrew: REPLACE_WITH_REGEX → dbt: SELECT REGEXP_REPLACE(col, pattern, replacement)
DataBrew: MASK_VALUE → dbt: SELECT MD5(col) or '[REDACTED]'
DataBrew: FILTER_ROWS → dbt: WHERE condition
DataBrew: FILL_MISSING_VALUES → dbt: SELECT COALESCE(col, default_value)

Step 3: Replicate data profiling with Great Expectations. Export DataBrew's profile report structure and recreate it as Great Expectations expectation suites. Great Expectations generates equivalent Data Docs (HTML documentation of data quality) stored on EU infrastructure.

Step 4: Schedule with Apache Airflow or cron. DataBrew's job scheduling is replaced by Apache Airflow DAGs (EU-deployable) or simple cron-scheduled Python scripts invoking dbt or Great Expectations.

The Compliance Decision

DataBrew's value proposition — no-code data transformation accessible to non-engineers — is real. Point-and-click recipe building is faster than writing SQL for many transformation tasks. But the GDPR cost is the loss of control over processing records: the Art.30 documentation, the Art.25 privacy-by-design evidence, the Art.9 data maps are generated in AWS service state, not in infrastructure the EU organization controls.

The alternatives preserve the workflow: dbt provides version-controlled transformation code with documentation and testing. Great Expectations provides equivalent data profiling with results stored on EU infrastructure. OpenRefine provides a visual interface for analysts. The transformation logic, processing records, and PII detection results stay in Git repositories and databases on EU infrastructure — not in Amazon.com's managed service state subject to CLOUD Act compelled disclosure.

For EU organizations processing personal data in data preparation pipelines, the Art.30 accountability requirement makes this straightforward: processing records must be in infrastructure where the data controller — not a US cloud provider — maintains authoritative control.


sota.io provides EU-native infrastructure for deploying data preparation pipelines (dbt, Great Expectations, Apache NiFi, Apache Airflow) entirely within EU jurisdiction. Explore EU-native data infrastructure →

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.