2026-04-30·14 min read·

AWS SageMaker EU Alternative 2026: ML Training Data, GDPR Compliance, and CLOUD Act Risk

Post #718 in the sota.io EU Compliance Series

AWS SageMaker is the managed machine learning platform that most AWS-native teams reach for first: it handles data preparation, model training, experiment tracking, feature management, and model deployment as an integrated service. The appeal is real — SageMaker eliminates substantial infrastructure work that would otherwise fall on ML engineering teams.

What SageMaker also does, by necessity, is accumulate every artifact your ML pipeline produces. Training datasets. Model weights. Experiment logs with hyperparameters and metrics. Feature store tables built from your user data. Notebook code containing exploratory data analysis. Endpoint configurations revealing your serving architecture. All of this accumulates in your AWS account, under US jurisdiction, subject to CLOUD Act compelled disclosure.

For ML teams processing personal data — which describes a very large fraction of production ML systems — this creates a GDPR exposure that is qualitatively different from a database or an object store. The European Data Protection Board has issued guidance making clear that models trained on personal data are themselves personal data artifacts. The exposure is not just the training data. It is the models, the features, the experiments, and the infrastructure that processes them.

What AWS SageMaker Stores

SageMaker is not a single service. It is an integrated platform with distinct storage implications at each layer.

Training Data and Processing Jobs

SageMaker training jobs pull data from S3. While the data itself lives in S3 (a separate risk surface), SageMaker maintains the metadata for every processing job: input data locations, transformation code, output locations, container configurations, and the full execution log.

{
  "ProcessingJobName": "user-behavior-feature-extraction-2026-04-15",
  "ProcessingInputs": [
    {
      "InputName": "user-events",
      "S3Input": {
        "S3Uri": "s3://prod-data-lake/users/events/2026/",
        "LocalPath": "/opt/ml/processing/input",
        "S3DataType": "S3Prefix"
      }
    }
  ],
  "ProcessingOutputConfig": {
    "Outputs": [
      {
        "OutputName": "processed-features",
        "S3Output": {
          "S3Uri": "s3://prod-features/user-behavior-v3/",
          "LocalPath": "/opt/ml/processing/output"
        }
      }
    ]
  },
  "Environment": {
    "FEATURE_VERSION": "v3",
    "USER_COHORT": "eu-gdpr-consented",
    "INCLUDE_PII_FIELDS": "false"
  }
}

The environment variables in this example reveal exactly what you are trying to avoid: the fact that you have EU GDPR-consented users as a distinct cohort, the feature version, and the negative confirmation that PII fields were excluded (implying they exist in the upstream data). SageMaker stores this metadata indefinitely under your account.

SageMaker Experiments and Experiment Tracking

SageMaker Experiments is the MLflow-equivalent built into the platform. Every training run logs:

# What SageMaker Experiments stores per training run
experiment_run = {
    "RunName": "recommendation-model-v12-eu-cohort",
    "Parameters": {
        "learning_rate": 0.0003,
        "batch_size": 512,
        "num_epochs": 50,
        "architecture": "two-tower-retrieval",
        "training_users": 847293,  # number of users in training set
        "evaluation_cohort": "eu-gdpr-consented-active-30d",
    },
    "Metrics": {
        "train_loss": [2.34, 1.87, 1.43, ...],
        "val_ndcg_at_10": [0.23, 0.31, 0.38, ...],
        "precision_at_5_eu_cohort": 0.44,
    },
    "Artifacts": {
        "model_output": "s3://prod-models/recommendation/v12/model.tar.gz",
        "training_data": "s3://prod-data-lake/users/interactions/2026-03/",
    }
}

The training_users: 847293 parameter reveals the size of your personal data training set. The cohort name eu-gdpr-consented-active-30d reveals your data segmentation strategy. The precision metric precision_at_5_eu_cohort reveals that you evaluate recommendation quality specifically on EU users. None of this is what most engineers think of as "personal data" — but under GDPR's broad definition, experiment metadata describing how you process EU users' behavioral data creates a processing record that must be handled accordingly.

SageMaker Feature Store

SageMaker Feature Store is designed to store pre-computed features for ML training and inference — tabular data derived from your raw inputs, stored for reuse across models and teams. Features are stored in both an online store (DynamoDB, for low-latency inference) and an offline store (S3 with Glue catalog, for training).

Feature stores built from user data are among the most sensitive ML artifacts from a GDPR perspective:

# Feature group definition — what's stored in SageMaker Feature Store
feature_group_config = {
    "FeatureGroupName": "user-purchase-behavior-features",
    "RecordIdentifierFeatureName": "user_id",
    "EventTimeFeatureName": "event_time",
    "FeatureDefinitions": [
        {"FeatureName": "user_id", "FeatureType": "String"},
        {"FeatureName": "purchase_count_30d", "FeatureType": "Integral"},
        {"FeatureName": "avg_order_value_eur", "FeatureType": "Fractional"},
        {"FeatureName": "preferred_category", "FeatureName": "String"},
        {"FeatureName": "churn_risk_score", "FeatureType": "Fractional"},
        {"FeatureName": "gdpr_consent_marketing", "FeatureName": "String"},
        {"FeatureName": "lifetime_value_tier", "FeatureType": "String"},
    ]
}

This feature group stores a churn_risk_score and lifetime_value_tier — both are inferred personal data derived from behavioral analysis. The gdpr_consent_marketing field reveals that consent status is stored as a feature, making the feature store effectively a record of user consent by user ID. Under GDPR, this feature table is personal data, and it lives in SageMaker Feature Store under US jurisdiction.

Model Artifacts and the Model Registry

SageMaker stores model artifacts (the trained model files) in S3, but manages their lifecycle through the SageMaker Model Registry. The registry tracks:

SageMaker Model Cards, introduced to support responsible AI documentation, create particularly comprehensive records:

{
  "ModelCardName": "user-churn-predictor-v4",
  "ModelCardStatus": "Approved",
  "Content": {
    "model_overview": {
      "model_description": "Predicts 30-day churn probability for EU consumer accounts",
      "model_creator": "ml-team@company.com",
      "training_datasets": ["s3://prod-data-lake/users/events/", "s3://prod-data-lake/purchases/"],
      "problem_type": "BinaryClassification"
    },
    "intended_uses": {
      "purpose_of_model": "Reduce EU customer churn via targeted retention offers",
      "intended_users": "CRM team, marketing automation",
      "out_of_scope_uses": "Hiring decisions, credit scoring"
    },
    "training_details": {
      "training_observations": 2847293,
      "evaluation_datasets": ["s3://prod-data-lake/users/holdout-eu-2026-q1/"]
    }
  }
}

The Model Card records that the model was trained on 2.8 million observations from EU consumer accounts. Under GDPR, a model trained specifically on EU personal data, with documentation explicitly describing that purpose, is a processing activity that requires a lawful basis, a DPIA for high-risk automated decision-making, and — when the model influences decisions about individuals — Article 22 compliance for automated processing.

SageMaker Studio Notebooks

SageMaker Studio provides Jupyter notebook environments for data scientists. These notebooks are stored in SageMaker's managed file system (backed by EFS) under your account. Notebooks frequently contain:

Notebooks are development artifacts that accumulate over time and are rarely audited for data residency. A notebook containing a sample row of personal data for debugging purposes is technically a personal data store — under US jurisdiction in SageMaker Studio.

SageMaker Pipelines

SageMaker Pipelines defines ML workflows as directed acyclic graphs (DAGs). A pipeline definition includes:

Pipeline execution records create a longitudinal audit trail of every ML workflow run — which data was used, which preprocessing steps were applied, which model version was produced. This is valuable for reproducibility. It is also a detailed record of your personal data processing activities stored under US jurisdiction.

The GDPR Risk Analysis

Are ML Models Personal Data?

The European Data Protection Board's guidance on machine learning is clear on one point: a model trained exclusively on personal data is not automatically personal data itself. The model cannot typically be used to reconstruct individual records (though model inversion attacks demonstrate this is not always true).

However, several SageMaker artifacts clearly are personal data:

SageMaker Feature Store — if features are keyed by user ID, the feature table is personal data. This is not the output of processing personal data; it is the personal data itself, stored in a new form.

Experiment metadata — when it records cohort sizes, data sources, and evaluation metrics broken down by user segment, it creates records linking processing activities to identifiable user populations.

Model cards with training dataset references — the combination of a model description ("predicts churn for EU users"), training data location ("s3://prod-data-lake/users/"), and training observation count creates a processing record that must be handled under GDPR.

Notebooks with embedded sample data — any notebook containing rows from a dataset with user IDs or other identifiers is straightforwardly personal data.

Article 22: Automated Decision-Making

SageMaker endpoints serve model predictions. When those predictions influence decisions about individuals — credit offers, content shown, prices charged, support prioritization — Article 22 of GDPR applies.

Article 22 requires that individuals have the right not to be subject to solely automated decisions with significant effects, unless explicit consent is given or the decision is necessary for contract performance. SageMaker's architecture makes it easy to build automated pipelines that go directly from model inference to decision — with no human review step.

The CLOUD Act risk compounds the Article 22 concern: under a government request, the models driving your automated decisions about EU individuals could be disclosed to US authorities, enabling analysis of your automated decision-making architecture without your knowledge.

Data Minimization and Retention

GDPR Article 5(1)(e) requires that personal data be kept no longer than necessary. SageMaker's default retention behavior is indefinite. Training job metadata, experiment logs, processing job records, and model registry history all accumulate without expiration.

For organizations with mature data retention policies, SageMaker creates a gap: personal data is deleted from the primary database on schedule, but experiment logs referencing that data — or feature store entries derived from it — remain in SageMaker indefinitely.

Cross-Border Transfer

SageMaker training jobs can be configured to run in eu-central-1 (Frankfurt) or eu-west-1 (Ireland). Running training in EU regions keeps compute local. However:

The architecture creates a situation where training data stays in EU S3 buckets, training compute runs in EU regions, but the metadata and control plane that manages all of it operates under US jurisdiction.

CLOUD Act Risk Profile

A CLOUD Act compelled disclosure of your SageMaker account could deliver:

What is disclosed:

What this enables:

  1. Understanding which personal data populations you use for ML training
  2. Identifying which decisions are driven by models trained on personal data
  3. Accessing feature tables that may be equivalent to database records for individual users
  4. Analyzing the decision-making architecture for Article 22 compliance assessment
  5. Identifying all data flows between your raw data lake and your ML models

For healthcare, insurance, financial services, and HR technology companies — sectors where ML decisions have significant effects on individuals — this risk profile is high.

EU-Sovereign ML Platform Alternatives

The open-source ML ecosystem has matured considerably. The core components of a SageMaker-equivalent stack can be assembled from EU-sovereign tools.

CapabilityAWS SageMakerMLflow (Self-hosted)KubeflowHopsworksClearML
Experiment trackingSageMaker ExperimentsMLflow TrackingKatibHopsworks ExperimentsClearML Experiments
Feature storeSageMaker Feature Storen/a (external)FeastHopsworks Feature Storen/a (external)
Model registrySageMaker Model RegistryMLflow Model RegistryKFServing Model RegistryHopsworks Model RegistryClearML Model Registry
Pipeline orchestrationSageMaker PipelinesMLflow ProjectsKubeflow PipelinesHSML PipelinesClearML Pipelines
Notebook environmentSageMaker StudioJupyterHub (self-hosted)Kubeflow NotebooksJupyterHubJupyterLab (self-hosted)
Model servingSageMaker EndpointsMLflow Models + BentoMLKServeHopsworks KServeClearML Serving
EU-sovereign?No (AWS, US jurisdiction)Yes (Apache 2.0, self-hosted)Yes (CNCF, self-hosted)Yes (EU HQ, open-source)Yes (self-hosted option)
LicenseAWS proprietaryApache 2.0Apache 2.0Apache 2.0 (AGPL for some components)Apache 2.0

MLflow (Self-Hosted)

MLflow is the most widely adopted open-source ML experiment tracking platform. It covers experiment logging, model registry, and model serving in a single package. Self-hosted MLflow on EU infrastructure eliminates the US jurisdiction dependency.

EU-sovereignty characteristics:

import mlflow
import mlflow.sklearn

# Point to your EU-sovereign MLflow instance
mlflow.set_tracking_uri("https://mlflow.your-eu-infra.com")

with mlflow.start_run(run_name="churn-model-v5"):
    mlflow.log_params({
        "learning_rate": 0.001,
        "n_estimators": 200,
        "training_cohort": "eu-gdpr-consented"
    })
    
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "churn-classifier")
    
# All experiment data stored in your PostgreSQL + MinIO on EU infrastructure

MLflow's artifact store can be backed by any S3-compatible storage. Hosting MinIO on Hetzner (German company, Frankfurt datacenter) keeps all model artifacts and experiment logs in EU jurisdiction.

Kubeflow

Kubeflow is a CNCF project providing a complete ML platform on Kubernetes. It covers experiment tracking (Katib for hyperparameter tuning), pipeline orchestration (Kubeflow Pipelines), notebook environments, and model serving (KServe).

EU-sovereignty characteristics:

# Kubeflow Pipeline definition — stored in your EU Kubernetes cluster
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: churn-training-pipeline
spec:
  templates:
    - name: preprocess
      container:
        image: your-eu-registry.com/ml/preprocessor:v3
        env:
          - name: DATA_SOURCE
            value: "s3://eu-data-lake/users/events/"
    - name: train
      container:
        image: your-eu-registry.com/ml/trainer:v3
        env:
          - name: MODEL_OUTPUT
            value: "s3://eu-models/churn/v5/"

Kubeflow Pipelines stores execution history in MySQL (EU-hosted) with artifact references in MinIO. No data leaves your Kubernetes cluster.

Hopsworks

Hopsworks is an EU-headquartered company (Stockholm, Sweden) providing an open-source ML platform with a particularly strong feature store implementation. The Hopsworks Feature Store supports both online (RonDB) and offline (Hive/Parquet) serving — equivalent to SageMaker Feature Store's dual-store architecture.

EU-sovereignty characteristics:

import hopsworks

# Connect to your self-hosted Hopsworks instance (EU infrastructure)
project = hopsworks.login(host="hopsworks.your-eu-infra.com")

fs = project.get_feature_store()

# Create feature group — stored in EU jurisdiction
user_behavior_fg = fs.create_feature_group(
    name="user_purchase_behavior",
    version=3,
    description="User purchase behavior features for recommendation models",
    primary_key=["user_id"],
    event_time="event_time",
    online_enabled=True,  # RonDB for low-latency serving
)

user_behavior_fg.insert(user_behavior_df)
# Feature data stored in your EU-hosted RonDB + Parquet/Hive

Hopsworks is the closest architectural equivalent to SageMaker Feature Store + SageMaker Experiments in a single open-source package.

Feast (Feature Store Only)

For teams that only need to replace SageMaker Feature Store, Feast (Feature Store, an open-source project incubated at Tecton) provides a focused solution.

from feast import FeatureStore

# Point to your EU-sovereign Feast registry
store = FeatureStore(repo_path=".")

# Retrieve features for batch scoring — data stays in your EU infrastructure
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_behavior:purchase_count_30d", "user_behavior:churn_risk_score"],
).to_df()

Feast stores the feature registry as a file (git-trackable) and supports EU-sovereign offline stores (BigQuery EU region, Redshift eu-west, S3 in eu-central-1, or local Parquet).

DVC (Data Version Control)

For teams primarily concerned with training data versioning and lineage — rather than experiment tracking — DVC provides git-like versioning for datasets and models.

# DVC stores data on your EU-sovereign remote
dvc remote add -d myremote s3://eu-data-bucket/dvc-store \
  --config core.endpointurl https://your-eu-s3-compatible.com

# Track training dataset — pointer in git, data in your EU storage
dvc add data/training/user_events_2026_q1.parquet
git add data/training/user_events_2026_q1.parquet.dvc
git commit -m "Add Q1 2026 user events training data"

# Push data to EU-sovereign remote
dvc push

DVC maintains data and model lineage entirely through git (for metadata) and your chosen remote storage (for actual data). No SageMaker-equivalent control plane.

Migration Strategy

Phase 1: Experiment Tracking Migration (Lowest Risk)

The lowest-risk first step is migrating new experiments to MLflow while leaving existing SageMaker infrastructure in place:

  1. Deploy MLflow on EU infrastructure (PostgreSQL + MinIO on Hetzner or Scaleway)
  2. Update new training scripts to use MLflow tracking URI instead of SageMaker Experiments
  3. Run existing models through both tracking systems during a transition period
  4. Retire SageMaker Experiments once the MLflow instance is stable

This migration requires only changing two lines in training scripts and adds no infrastructure risk to production model serving.

Phase 2: Feature Store Migration

Feature store migration is higher risk because it affects production inference:

  1. Stand up Hopsworks or Feast on EU infrastructure
  2. Replay historical feature computation to populate the new store
  3. Validate feature values match between SageMaker Feature Store and the new store
  4. Switch serving code to use the new feature store (online store migration)
  5. Update training pipelines to read from the new offline store
  6. Deprecate SageMaker Feature Store after a validation period

The critical step is validating feature value parity — any discrepancy can cause model performance degradation in production.

Phase 3: Notebook Migration to JupyterHub

SageMaker Studio notebooks can be exported as standard Jupyter notebooks and run anywhere:

# Export SageMaker Studio notebooks
aws sagemaker list-user-profiles --studio-id your-studio
# Download notebooks via SageMaker API or directly from EFS backup

# Stand up JupyterHub on EU Kubernetes
helm install jupyterhub jupyterhub/jupyterhub \
  --namespace jupyter \
  --values config.yaml  # points to your EU-sovereign storage

Compliance Checklist

Immediate (1 week):

Short-term (1 month):

Strategic (3 months):

Running ML Workloads on sota.io

sota.io provides EU-sovereign compute for containerized ML workloads — model training, batch scoring, and model serving. Training jobs and inference endpoints deployed on sota.io run entirely within EU infrastructure with no US-jurisdiction control plane dependency.

ML pipelines built with MLflow or Kubeflow can deploy serving containers to sota.io with experiment metadata and model artifacts stored in your EU-sovereign MLflow or Hopsworks instance:

# Deploy EU-sovereign model serving endpoint on sota.io
sota deploy \
  --image your-eu-registry.com/models/churn-predictor:v5 \
  --env MLFLOW_TRACKING_URI=https://mlflow.your-eu-infra.com \
  --env MODEL_URI=mlflow://churn-model/production \
  --env FEATURE_STORE_URI=https://hopsworks.your-eu-infra.com

The model artifact, feature store, experiment history, and serving infrastructure all remain under EU jurisdiction. No SageMaker control plane dependency.

Summary

AWS SageMaker is the gravitational center of AWS-native ML workflows, and its gravity comes from accumulation: the longer you use it, the more training data references, experiment runs, feature store entries, model versions, and notebook code it holds. For teams training models on personal data — a description that covers most production recommendation, personalization, fraud detection, and churn prediction systems — that accumulation creates a GDPR exposure that grows with every training run.

The EU alternative stack is mature: MLflow for experiment tracking, Hopsworks or Feast for feature management, Kubeflow for pipeline orchestration, and KServe for model serving. Each component is individually production-proven and can be self-hosted on EU infrastructure. The challenge is integration — SageMaker's value is that these components work together out of the box. Building the equivalent stack requires integration work that SageMaker eliminates.

The question for each team is whether that integration cost is worth the compliance gain. For teams in healthcare, financial services, insurance, or any sector where automated ML decisions must meet high accountability standards, the answer increasingly is yes — not because SageMaker is uniquely risky, but because the combination of US jurisdiction, indefinite artifact retention, and deep integration with personal data processing creates a risk surface that is difficult to control without moving to a self-hosted stack.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.