AWS SageMaker EU Alternative 2026: ML Training Data, GDPR Compliance, and CLOUD Act Risk
Post #718 in the sota.io EU Compliance Series
AWS SageMaker is the managed machine learning platform that most AWS-native teams reach for first: it handles data preparation, model training, experiment tracking, feature management, and model deployment as an integrated service. The appeal is real — SageMaker eliminates substantial infrastructure work that would otherwise fall on ML engineering teams.
What SageMaker also does, by necessity, is accumulate every artifact your ML pipeline produces. Training datasets. Model weights. Experiment logs with hyperparameters and metrics. Feature store tables built from your user data. Notebook code containing exploratory data analysis. Endpoint configurations revealing your serving architecture. All of this accumulates in your AWS account, under US jurisdiction, subject to CLOUD Act compelled disclosure.
For ML teams processing personal data — which describes a very large fraction of production ML systems — this creates a GDPR exposure that is qualitatively different from a database or an object store. The European Data Protection Board has issued guidance making clear that models trained on personal data are themselves personal data artifacts. The exposure is not just the training data. It is the models, the features, the experiments, and the infrastructure that processes them.
What AWS SageMaker Stores
SageMaker is not a single service. It is an integrated platform with distinct storage implications at each layer.
Training Data and Processing Jobs
SageMaker training jobs pull data from S3. While the data itself lives in S3 (a separate risk surface), SageMaker maintains the metadata for every processing job: input data locations, transformation code, output locations, container configurations, and the full execution log.
{
"ProcessingJobName": "user-behavior-feature-extraction-2026-04-15",
"ProcessingInputs": [
{
"InputName": "user-events",
"S3Input": {
"S3Uri": "s3://prod-data-lake/users/events/2026/",
"LocalPath": "/opt/ml/processing/input",
"S3DataType": "S3Prefix"
}
}
],
"ProcessingOutputConfig": {
"Outputs": [
{
"OutputName": "processed-features",
"S3Output": {
"S3Uri": "s3://prod-features/user-behavior-v3/",
"LocalPath": "/opt/ml/processing/output"
}
}
]
},
"Environment": {
"FEATURE_VERSION": "v3",
"USER_COHORT": "eu-gdpr-consented",
"INCLUDE_PII_FIELDS": "false"
}
}
The environment variables in this example reveal exactly what you are trying to avoid: the fact that you have EU GDPR-consented users as a distinct cohort, the feature version, and the negative confirmation that PII fields were excluded (implying they exist in the upstream data). SageMaker stores this metadata indefinitely under your account.
SageMaker Experiments and Experiment Tracking
SageMaker Experiments is the MLflow-equivalent built into the platform. Every training run logs:
- Hyperparameters (learning rate, batch size, model architecture choices)
- Metrics (loss curves, accuracy, AUC — including per-class breakdowns)
- Input dataset references (S3 URIs of training, validation, and test sets)
- Output artifact locations (where trained model files were saved)
- Container and instance type used
- Start and end timestamps, run duration
# What SageMaker Experiments stores per training run
experiment_run = {
"RunName": "recommendation-model-v12-eu-cohort",
"Parameters": {
"learning_rate": 0.0003,
"batch_size": 512,
"num_epochs": 50,
"architecture": "two-tower-retrieval",
"training_users": 847293, # number of users in training set
"evaluation_cohort": "eu-gdpr-consented-active-30d",
},
"Metrics": {
"train_loss": [2.34, 1.87, 1.43, ...],
"val_ndcg_at_10": [0.23, 0.31, 0.38, ...],
"precision_at_5_eu_cohort": 0.44,
},
"Artifacts": {
"model_output": "s3://prod-models/recommendation/v12/model.tar.gz",
"training_data": "s3://prod-data-lake/users/interactions/2026-03/",
}
}
The training_users: 847293 parameter reveals the size of your personal data training set. The cohort name eu-gdpr-consented-active-30d reveals your data segmentation strategy. The precision metric precision_at_5_eu_cohort reveals that you evaluate recommendation quality specifically on EU users. None of this is what most engineers think of as "personal data" — but under GDPR's broad definition, experiment metadata describing how you process EU users' behavioral data creates a processing record that must be handled accordingly.
SageMaker Feature Store
SageMaker Feature Store is designed to store pre-computed features for ML training and inference — tabular data derived from your raw inputs, stored for reuse across models and teams. Features are stored in both an online store (DynamoDB, for low-latency inference) and an offline store (S3 with Glue catalog, for training).
Feature stores built from user data are among the most sensitive ML artifacts from a GDPR perspective:
# Feature group definition — what's stored in SageMaker Feature Store
feature_group_config = {
"FeatureGroupName": "user-purchase-behavior-features",
"RecordIdentifierFeatureName": "user_id",
"EventTimeFeatureName": "event_time",
"FeatureDefinitions": [
{"FeatureName": "user_id", "FeatureType": "String"},
{"FeatureName": "purchase_count_30d", "FeatureType": "Integral"},
{"FeatureName": "avg_order_value_eur", "FeatureType": "Fractional"},
{"FeatureName": "preferred_category", "FeatureName": "String"},
{"FeatureName": "churn_risk_score", "FeatureType": "Fractional"},
{"FeatureName": "gdpr_consent_marketing", "FeatureName": "String"},
{"FeatureName": "lifetime_value_tier", "FeatureType": "String"},
]
}
This feature group stores a churn_risk_score and lifetime_value_tier — both are inferred personal data derived from behavioral analysis. The gdpr_consent_marketing field reveals that consent status is stored as a feature, making the feature store effectively a record of user consent by user ID. Under GDPR, this feature table is personal data, and it lives in SageMaker Feature Store under US jurisdiction.
Model Artifacts and the Model Registry
SageMaker stores model artifacts (the trained model files) in S3, but manages their lifecycle through the SageMaker Model Registry. The registry tracks:
- Model versions and their S3 artifact locations
- Training job lineage (which training job produced which model)
- Approval status (staging, production, rejected)
- Model cards with bias reports and evaluation results
- Custom metadata including model purpose descriptions
SageMaker Model Cards, introduced to support responsible AI documentation, create particularly comprehensive records:
{
"ModelCardName": "user-churn-predictor-v4",
"ModelCardStatus": "Approved",
"Content": {
"model_overview": {
"model_description": "Predicts 30-day churn probability for EU consumer accounts",
"model_creator": "ml-team@company.com",
"training_datasets": ["s3://prod-data-lake/users/events/", "s3://prod-data-lake/purchases/"],
"problem_type": "BinaryClassification"
},
"intended_uses": {
"purpose_of_model": "Reduce EU customer churn via targeted retention offers",
"intended_users": "CRM team, marketing automation",
"out_of_scope_uses": "Hiring decisions, credit scoring"
},
"training_details": {
"training_observations": 2847293,
"evaluation_datasets": ["s3://prod-data-lake/users/holdout-eu-2026-q1/"]
}
}
}
The Model Card records that the model was trained on 2.8 million observations from EU consumer accounts. Under GDPR, a model trained specifically on EU personal data, with documentation explicitly describing that purpose, is a processing activity that requires a lawful basis, a DPIA for high-risk automated decision-making, and — when the model influences decisions about individuals — Article 22 compliance for automated processing.
SageMaker Studio Notebooks
SageMaker Studio provides Jupyter notebook environments for data scientists. These notebooks are stored in SageMaker's managed file system (backed by EFS) under your account. Notebooks frequently contain:
- Exploratory data analysis with sample data rows (sometimes copied from production)
- Visualization code that renders individual user records
- Ad-hoc queries with hardcoded filters like
WHERE user_id = '12345' - Data validation code that prints statistical summaries of personal data
- Comments with context about data provenance ("this cohort is GDPR-consented EU users from Q1")
Notebooks are development artifacts that accumulate over time and are rarely audited for data residency. A notebook containing a sample row of personal data for debugging purposes is technically a personal data store — under US jurisdiction in SageMaker Studio.
SageMaker Pipelines
SageMaker Pipelines defines ML workflows as directed acyclic graphs (DAGs). A pipeline definition includes:
- Step definitions with input/output specifications
- Parameter values (including data paths and cohort filters)
- Step dependencies
- Execution history with full provenance
Pipeline execution records create a longitudinal audit trail of every ML workflow run — which data was used, which preprocessing steps were applied, which model version was produced. This is valuable for reproducibility. It is also a detailed record of your personal data processing activities stored under US jurisdiction.
The GDPR Risk Analysis
Are ML Models Personal Data?
The European Data Protection Board's guidance on machine learning is clear on one point: a model trained exclusively on personal data is not automatically personal data itself. The model cannot typically be used to reconstruct individual records (though model inversion attacks demonstrate this is not always true).
However, several SageMaker artifacts clearly are personal data:
SageMaker Feature Store — if features are keyed by user ID, the feature table is personal data. This is not the output of processing personal data; it is the personal data itself, stored in a new form.
Experiment metadata — when it records cohort sizes, data sources, and evaluation metrics broken down by user segment, it creates records linking processing activities to identifiable user populations.
Model cards with training dataset references — the combination of a model description ("predicts churn for EU users"), training data location ("s3://prod-data-lake/users/"), and training observation count creates a processing record that must be handled under GDPR.
Notebooks with embedded sample data — any notebook containing rows from a dataset with user IDs or other identifiers is straightforwardly personal data.
Article 22: Automated Decision-Making
SageMaker endpoints serve model predictions. When those predictions influence decisions about individuals — credit offers, content shown, prices charged, support prioritization — Article 22 of GDPR applies.
Article 22 requires that individuals have the right not to be subject to solely automated decisions with significant effects, unless explicit consent is given or the decision is necessary for contract performance. SageMaker's architecture makes it easy to build automated pipelines that go directly from model inference to decision — with no human review step.
The CLOUD Act risk compounds the Article 22 concern: under a government request, the models driving your automated decisions about EU individuals could be disclosed to US authorities, enabling analysis of your automated decision-making architecture without your knowledge.
Data Minimization and Retention
GDPR Article 5(1)(e) requires that personal data be kept no longer than necessary. SageMaker's default retention behavior is indefinite. Training job metadata, experiment logs, processing job records, and model registry history all accumulate without expiration.
For organizations with mature data retention policies, SageMaker creates a gap: personal data is deleted from the primary database on schedule, but experiment logs referencing that data — or feature store entries derived from it — remain in SageMaker indefinitely.
Cross-Border Transfer
SageMaker training jobs can be configured to run in eu-central-1 (Frankfurt) or eu-west-1 (Ireland). Running training in EU regions keeps compute local. However:
- SageMaker's control plane (the APIs you call to create training jobs, query experiments, manage the model registry) runs in US-East-1 for global AWS services
- SageMaker Studio configuration and user management runs through global AWS IAM
- CloudTrail logs of all SageMaker API calls may be stored in your preferred region but are accessible globally through the AWS control plane
The architecture creates a situation where training data stays in EU S3 buckets, training compute runs in EU regions, but the metadata and control plane that manages all of it operates under US jurisdiction.
CLOUD Act Risk Profile
A CLOUD Act compelled disclosure of your SageMaker account could deliver:
What is disclosed:
- All training job definitions, including data source locations and model output paths
- Experiment logs with hyperparameters, metrics, and cohort identifiers for every training run
- Feature store tables containing pre-computed features keyed by user ID
- Model registry with training lineage, model cards, and approval history
- Pipeline definitions showing your full ML workflow architecture
- Notebook code, including any embedded data samples
What this enables:
- Understanding which personal data populations you use for ML training
- Identifying which decisions are driven by models trained on personal data
- Accessing feature tables that may be equivalent to database records for individual users
- Analyzing the decision-making architecture for Article 22 compliance assessment
- Identifying all data flows between your raw data lake and your ML models
For healthcare, insurance, financial services, and HR technology companies — sectors where ML decisions have significant effects on individuals — this risk profile is high.
EU-Sovereign ML Platform Alternatives
The open-source ML ecosystem has matured considerably. The core components of a SageMaker-equivalent stack can be assembled from EU-sovereign tools.
| Capability | AWS SageMaker | MLflow (Self-hosted) | Kubeflow | Hopsworks | ClearML |
|---|---|---|---|---|---|
| Experiment tracking | SageMaker Experiments | MLflow Tracking | Katib | Hopsworks Experiments | ClearML Experiments |
| Feature store | SageMaker Feature Store | n/a (external) | Feast | Hopsworks Feature Store | n/a (external) |
| Model registry | SageMaker Model Registry | MLflow Model Registry | KFServing Model Registry | Hopsworks Model Registry | ClearML Model Registry |
| Pipeline orchestration | SageMaker Pipelines | MLflow Projects | Kubeflow Pipelines | HSML Pipelines | ClearML Pipelines |
| Notebook environment | SageMaker Studio | JupyterHub (self-hosted) | Kubeflow Notebooks | JupyterHub | JupyterLab (self-hosted) |
| Model serving | SageMaker Endpoints | MLflow Models + BentoML | KServe | Hopsworks KServe | ClearML Serving |
| EU-sovereign? | No (AWS, US jurisdiction) | Yes (Apache 2.0, self-hosted) | Yes (CNCF, self-hosted) | Yes (EU HQ, open-source) | Yes (self-hosted option) |
| License | AWS proprietary | Apache 2.0 | Apache 2.0 | Apache 2.0 (AGPL for some components) | Apache 2.0 |
MLflow (Self-Hosted)
MLflow is the most widely adopted open-source ML experiment tracking platform. It covers experiment logging, model registry, and model serving in a single package. Self-hosted MLflow on EU infrastructure eliminates the US jurisdiction dependency.
EU-sovereignty characteristics:
- Backend stores: PostgreSQL or MySQL (EU-hosted), S3-compatible for artifact storage (MinIO on Hetzner, OVH Object Storage)
- No telemetry when self-hosted
- Apache 2.0 license — no vendor lock-in
- Compatible with any cloud provider or on-premises setup
import mlflow
import mlflow.sklearn
# Point to your EU-sovereign MLflow instance
mlflow.set_tracking_uri("https://mlflow.your-eu-infra.com")
with mlflow.start_run(run_name="churn-model-v5"):
mlflow.log_params({
"learning_rate": 0.001,
"n_estimators": 200,
"training_cohort": "eu-gdpr-consented"
})
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
mlflow.log_metric("accuracy", accuracy)
mlflow.sklearn.log_model(model, "churn-classifier")
# All experiment data stored in your PostgreSQL + MinIO on EU infrastructure
MLflow's artifact store can be backed by any S3-compatible storage. Hosting MinIO on Hetzner (German company, Frankfurt datacenter) keeps all model artifacts and experiment logs in EU jurisdiction.
Kubeflow
Kubeflow is a CNCF project providing a complete ML platform on Kubernetes. It covers experiment tracking (Katib for hyperparameter tuning), pipeline orchestration (Kubeflow Pipelines), notebook environments, and model serving (KServe).
EU-sovereignty characteristics:
- All state stored in Kubernetes (etcd) and MinIO — both can be EU-hosted
- Runs on any Kubernetes cluster (Hetzner K3s, OVH Managed Kubernetes, on-premises)
- CNCF governance, not a US commercial entity controlling the roadmap
- Integrated with EU-sovereign object storage for artifacts
# Kubeflow Pipeline definition — stored in your EU Kubernetes cluster
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: churn-training-pipeline
spec:
templates:
- name: preprocess
container:
image: your-eu-registry.com/ml/preprocessor:v3
env:
- name: DATA_SOURCE
value: "s3://eu-data-lake/users/events/"
- name: train
container:
image: your-eu-registry.com/ml/trainer:v3
env:
- name: MODEL_OUTPUT
value: "s3://eu-models/churn/v5/"
Kubeflow Pipelines stores execution history in MySQL (EU-hosted) with artifact references in MinIO. No data leaves your Kubernetes cluster.
Hopsworks
Hopsworks is an EU-headquartered company (Stockholm, Sweden) providing an open-source ML platform with a particularly strong feature store implementation. The Hopsworks Feature Store supports both online (RonDB) and offline (Hive/Parquet) serving — equivalent to SageMaker Feature Store's dual-store architecture.
EU-sovereignty characteristics:
- EU headquarters and primary development team
- Open-source (Apache 2.0 for core, AGPL for some enterprise components)
- Self-hosted or SaaS on EU-region cloud providers
- Feature versioning, data validation, and lineage tracking built in
import hopsworks
# Connect to your self-hosted Hopsworks instance (EU infrastructure)
project = hopsworks.login(host="hopsworks.your-eu-infra.com")
fs = project.get_feature_store()
# Create feature group — stored in EU jurisdiction
user_behavior_fg = fs.create_feature_group(
name="user_purchase_behavior",
version=3,
description="User purchase behavior features for recommendation models",
primary_key=["user_id"],
event_time="event_time",
online_enabled=True, # RonDB for low-latency serving
)
user_behavior_fg.insert(user_behavior_df)
# Feature data stored in your EU-hosted RonDB + Parquet/Hive
Hopsworks is the closest architectural equivalent to SageMaker Feature Store + SageMaker Experiments in a single open-source package.
Feast (Feature Store Only)
For teams that only need to replace SageMaker Feature Store, Feast (Feature Store, an open-source project incubated at Tecton) provides a focused solution.
from feast import FeatureStore
# Point to your EU-sovereign Feast registry
store = FeatureStore(repo_path=".")
# Retrieve features for batch scoring — data stays in your EU infrastructure
training_df = store.get_historical_features(
entity_df=entity_df,
features=["user_behavior:purchase_count_30d", "user_behavior:churn_risk_score"],
).to_df()
Feast stores the feature registry as a file (git-trackable) and supports EU-sovereign offline stores (BigQuery EU region, Redshift eu-west, S3 in eu-central-1, or local Parquet).
DVC (Data Version Control)
For teams primarily concerned with training data versioning and lineage — rather than experiment tracking — DVC provides git-like versioning for datasets and models.
# DVC stores data on your EU-sovereign remote
dvc remote add -d myremote s3://eu-data-bucket/dvc-store \
--config core.endpointurl https://your-eu-s3-compatible.com
# Track training dataset — pointer in git, data in your EU storage
dvc add data/training/user_events_2026_q1.parquet
git add data/training/user_events_2026_q1.parquet.dvc
git commit -m "Add Q1 2026 user events training data"
# Push data to EU-sovereign remote
dvc push
DVC maintains data and model lineage entirely through git (for metadata) and your chosen remote storage (for actual data). No SageMaker-equivalent control plane.
Migration Strategy
Phase 1: Experiment Tracking Migration (Lowest Risk)
The lowest-risk first step is migrating new experiments to MLflow while leaving existing SageMaker infrastructure in place:
- Deploy MLflow on EU infrastructure (PostgreSQL + MinIO on Hetzner or Scaleway)
- Update new training scripts to use MLflow tracking URI instead of SageMaker Experiments
- Run existing models through both tracking systems during a transition period
- Retire SageMaker Experiments once the MLflow instance is stable
This migration requires only changing two lines in training scripts and adds no infrastructure risk to production model serving.
Phase 2: Feature Store Migration
Feature store migration is higher risk because it affects production inference:
- Stand up Hopsworks or Feast on EU infrastructure
- Replay historical feature computation to populate the new store
- Validate feature values match between SageMaker Feature Store and the new store
- Switch serving code to use the new feature store (online store migration)
- Update training pipelines to read from the new offline store
- Deprecate SageMaker Feature Store after a validation period
The critical step is validating feature value parity — any discrepancy can cause model performance degradation in production.
Phase 3: Notebook Migration to JupyterHub
SageMaker Studio notebooks can be exported as standard Jupyter notebooks and run anywhere:
# Export SageMaker Studio notebooks
aws sagemaker list-user-profiles --studio-id your-studio
# Download notebooks via SageMaker API or directly from EFS backup
# Stand up JupyterHub on EU Kubernetes
helm install jupyterhub jupyterhub/jupyterhub \
--namespace jupyter \
--values config.yaml # points to your EU-sovereign storage
Compliance Checklist
Immediate (1 week):
- Audit SageMaker Feature Store for feature groups keyed by user ID or other personal identifiers
- Review SageMaker Experiments for runs containing EU user cohort parameters
- Identify SageMaker Studio notebooks containing embedded data samples
- Check SageMaker Model Cards for models trained on personal data (Article 22 relevance)
Short-term (1 month):
- Deploy MLflow on EU infrastructure as the migration target for experiment tracking
- Evaluate Hopsworks or Feast for feature store replacement
- Establish data minimization policy for experiment metadata (cohort size caps, no raw sample logging)
- Conduct DPIA for any SageMaker-based systems making automated decisions about EU individuals
Strategic (3 months):
- Complete experiment tracking migration to EU-sovereign MLflow
- Begin feature store migration — online store first (inference path), then offline (training path)
- Update ROPA (Records of Processing Activities) to reflect SageMaker dependency reduction
- Implement notebook data handling policy (no embedded production data in development notebooks)
Running ML Workloads on sota.io
sota.io provides EU-sovereign compute for containerized ML workloads — model training, batch scoring, and model serving. Training jobs and inference endpoints deployed on sota.io run entirely within EU infrastructure with no US-jurisdiction control plane dependency.
ML pipelines built with MLflow or Kubeflow can deploy serving containers to sota.io with experiment metadata and model artifacts stored in your EU-sovereign MLflow or Hopsworks instance:
# Deploy EU-sovereign model serving endpoint on sota.io
sota deploy \
--image your-eu-registry.com/models/churn-predictor:v5 \
--env MLFLOW_TRACKING_URI=https://mlflow.your-eu-infra.com \
--env MODEL_URI=mlflow://churn-model/production \
--env FEATURE_STORE_URI=https://hopsworks.your-eu-infra.com
The model artifact, feature store, experiment history, and serving infrastructure all remain under EU jurisdiction. No SageMaker control plane dependency.
Summary
AWS SageMaker is the gravitational center of AWS-native ML workflows, and its gravity comes from accumulation: the longer you use it, the more training data references, experiment runs, feature store entries, model versions, and notebook code it holds. For teams training models on personal data — a description that covers most production recommendation, personalization, fraud detection, and churn prediction systems — that accumulation creates a GDPR exposure that grows with every training run.
The EU alternative stack is mature: MLflow for experiment tracking, Hopsworks or Feast for feature management, Kubeflow for pipeline orchestration, and KServe for model serving. Each component is individually production-proven and can be self-hosted on EU infrastructure. The challenge is integration — SageMaker's value is that these components work together out of the box. Building the equivalent stack requires integration work that SageMaker eliminates.
The question for each team is whether that integration cost is worth the compliance gain. For teams in healthcare, financial services, insurance, or any sector where automated ML decisions must meet high accountability standards, the answer increasingly is yes — not because SageMaker is uniquely risky, but because the combination of US jurisdiction, indefinite artifact retention, and deep integration with personal data processing creates a risk surface that is difficult to control without moving to a self-hosted stack.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.