2026-05-05·10 min read·

AWS Ground Truth EU Alternative: GDPR-Compliant ML Data Labeling in 2026

Training a machine learning model starts with labeled data. If your model learns to classify medical images, detect faces, transcribe speech, or rank product recommendations, someone — or something — must annotate the raw data first. AWS Ground Truth automates much of that work. The problem is that it does so under US jurisdiction, through sub-processors that your GDPR obligations treat very differently from what your AWS contract implies.

This guide covers the specific GDPR and CLOUD Act risks Ground Truth introduces, the articles that apply, and the EU-native alternatives that let your ML pipeline stay within European data sovereignty.

What AWS Ground Truth Does

AWS Ground Truth is a managed data labeling service. You upload an input dataset to S3, select a labeling task type (image classification, bounding box, semantic segmentation, text classification, video frame labeling, and others), choose a labeling workforce, and Ground Truth produces a labeled output dataset and trained model for active learning.

There are three workforce options:

Amazon Mechanical Turk — public crowdworkers, managed by Amazon.
AWS Marketplace vendors — third-party annotation companies approved by AWS.
Private workforce — your own employees or contractors, managed via Amazon Cognito.

For developers building GDPR-compliant ML products, the choice of workforce is not just an operational decision. It determines your processor chain under GDPR Article 28.

Any time you upload personal data to AWS Ground Truth for annotation, AWS becomes your data processor. That is manageable — you have an AWS DPA (Data Processing Addendum) in place through your standard service agreement.

The complication arises with the labeling workforce.

When you use Mechanical Turk, Amazon engages individual Turkers as workers on your labeling job. These are not AWS employees. Amazon acts as a sub-processor, and each Turker briefly accesses your data to complete the annotation task. Under GDPR Article 28(2), you must authorise any sub-processor in writing, and sub-processors are bound by the same data protection obligations as the primary processor. Amazon's DPA does authorise sub-processors, but the list is generic, not specific to Turkers — and your DPA with Amazon does not give you direct visibility into which jurisdictions Turkers operate from.

This matters because Mechanical Turk workers are distributed globally, including the United States. When a Turker in Ohio labels an image of your user's face, that image has been transferred outside the EU/EEA for processing. Chapter V of GDPR requires that such transfers happen under one of the lawful transfer mechanisms: an adequacy decision, Standard Contractual Clauses, or Binding Corporate Rules.

Mechanical Turk's contractual chain does not cleanly provide this. Amazon's DPA SCCs cover the AWS-to-customer relationship, not the customer-to-Turker chain that Ground Truth creates.

When you use a private workforce, you control the workers, so the sub-processor concern disappears. But all the infrastructure — dataset storage, labeling job metadata, model artifacts from active learning — still lives in your AWS account in the region you selected. If that region is eu-west-1 (Ireland), the data rests on EU servers. However, it remains under an AWS contract governed by US law and subject to CLOUD Act access requests.

CLOUD Act Exposure for Training Data

The CLOUD Act (Clarifying Lawful Overseas Use of Data Act, 2018) requires US-based cloud providers to disclose customer data to US law enforcement when served with a qualifying order — regardless of where the data is physically stored. AWS, as a US company, is subject to this obligation.

For most business data this is a low-probability risk. For ML training datasets, the calculus changes:

Health and biometric data (images of skin conditions, X-rays, voice recordings, facial photos) labeled in Ground Truth is GDPR Article 9 special category data. Art.9 processing requires an explicit lawful basis that is narrower than ordinary personal data. CLOUD Act disclosure of Art.9 data to US authorities is a transfer that likely lacks a valid GDPR transfer mechanism.
Behavioral training data (click logs, NLP corpora built from customer-support transcripts, e-commerce purchase histories) can contain enough information to uniquely identify individuals. Disclosure under CLOUD Act is a data breach event under GDPR Article 4(12) if it involves "accidental... disclosure to unauthorised recipients."
Model artifacts themselves can encode personal data. A face recognition model trained on faces of EU residents is itself a form of derived personal data — its weights carry information about the training set. AWS stores Ground Truth model artifacts in S3 under your account, but under AWS jurisdiction.

Article 9 Labeling Tasks and the Biometric Data Trap

Ground Truth includes built-in templates for image classification, video object detection, and pose estimation. If your input images contain faces, the labeling task is processing biometric data under GDPR Article 4(14): data resulting from specific technical processing relating to physical characteristics that allows unique identification of a natural person.

Biometric data is a special category under Article 9(1). Processing it requires one of the Article 9(2) lawful bases — most commonly explicit consent (Art.9(2)(a)) or substantial public interest with appropriate safeguards (Art.9(2)(g)). Relying on implicit consent from general terms of service is insufficient.

More practically: if your Ground Truth labeling job asks Turkers to draw bounding boxes around faces, you are sharing biometric data with a sub-processor (Turkers) that lacks a clear Art.9 processing basis, a DPIA under Art.35 if the volume qualifies, and SCCs covering the transfer if Turkers are in third countries.

The combination — Art.9 data, US sub-processors, no clear SCCs — is exactly the pattern that DPAs have been citing in enforcement actions since 2023.

DPIA Requirement

Under GDPR Article 35(3)(b), processing on a large scale of special categories of data requires a Data Protection Impact Assessment before processing begins. Ground Truth jobs labeling medical images, voice data, or biometric photos on any production scale almost certainly cross the threshold.

The EDPB DPIA Template v1.0 (April 2026, consultation until 9 June 2026) provides a structured format. For a Ground Truth labeling pipeline, a DPIA must address at minimum:

Description of processing: what data, at what scale, for what purpose
Necessity and proportionality: why labeled training data requires the specific personal data involved
Risk assessment: CLOUD Act access, third-country sub-processors, special category exposure
Mitigation measures: encryption, pseudonymization, private workforce, EU-only infrastructure
Residual risk and DPO consultation

The new EDPB template adds an explicit Section 7 on AI Act Article 26(9) — relevant if the ML model you are training qualifies as a high-risk AI system under Annex III. Labeling data for a biometric categorisation model (Annex III, point 1), an educational assessment system, or a credit scoring system triggers both the GDPR DPIA and the AI Act FRIA (Fundamental Rights Impact Assessment), which the template now integrates.

EU-Native Alternatives to AWS Ground Truth

The following tools let you run a complete ML data labeling pipeline on infrastructure you control, within EU jurisdiction, without Mechanical Turk in the processor chain.

Label Studio

Type: Open-source (Apache 2.0), self-hostable
Jurisdiction: Deploy on any EU infrastructure (Hetzner, OVH, Scaleway, sota.io)
Labeling types: Image, video, text, audio, time series, NLP, custom templates via JSON config
Key features: Active learning integration, multi-annotator agreement, OIDC/SAML SSO, REST API, Python SDK
GDPR posture: You control storage, compute, and labeler access. No third-party sub-processor unless you choose to involve one. Full Art.28 chain is within your organisation.

Label Studio is the most-deployed open-source annotation platform. The self-hosted version (Label Studio Community Edition) has no feature limits for basic labeling and integrates directly with ML model inference for active learning. Deploying it on a GPU-enabled EU VPS or on sota.io gives you the same functionality as Ground Truth's private workforce mode, with full infrastructure control.

The commercial Label Studio Enterprise adds team management, advanced RBAC, and audit logging — relevant for Art.5(2) accountability under GDPR if you need to demonstrate who labeled which record and when.

CVAT (Computer Vision Annotation Tool)

Type: Open-source (MIT), self-hostable
Jurisdiction: Originally developed at Intel, now a CNCF Sandbox project — deployable on any EU infrastructure
Labeling types: Image, video, 3D point cloud, semantic segmentation, bounding box, polygon, polyline, skeleton tracking
Key features: Automatic annotation via AI models (YOLO, SAM, Grounding DINO), task queue management, webhook notifications, Nuclio serverless function integration for model-assisted labeling
GDPR posture: Self-hosted, no external sub-processors. CVAT's cloud version (app.cvat.ai) uses US infrastructure — use the self-hosted path for EU compliance.

CVAT is the leading tool for computer vision annotation at scale. Its model-assisted labeling (connecting a YOLO or Segment Anything Model via Nuclio) reduces human annotation time significantly — the same Active Learning value proposition as Ground Truth, without the CLOUD Act exposure.

Segments.ai

Type: Commercial SaaS
Jurisdiction: Belgium (EU-native, EU-incorporated entity, EU data hosting)
Labeling types: 2D/3D image segmentation, lidar point cloud, video
Key features: Managed annotation workforce (European), Python SDK, REST API, active learning workflows
GDPR posture: EU entity as data processor. No US-parent CLOUD Act exposure. Managed workforce with explicit sub-processor agreements under EU law.

Segments.ai is relevant when you need managed annotators but want to keep the entire processor chain within the EU. It positions directly as a GDPR-native Ground Truth alternative for computer vision and autonomous systems use cases.

Argilla (formerly Rubrix)

Type: Open-source (Apache 2.0), self-hostable
Jurisdiction: EU-origin (acquired by Hugging Face 2023), deployable anywhere
Labeling types: NLP — text classification, token classification (NER), text generation feedback, question answering, LLM preference annotation (RLHF)
Key features: LLM fine-tuning dataset creation, RLHF feedback collection, Python SDK, Hugging Face Hub integration
GDPR posture: Self-hosted; data stays where you deploy it. Suited for LLM training data curation and human feedback collection that involves personal data in text form.

Argilla is the go-to open-source tool when your labeling task is NLP or LLM-related rather than computer vision. If you're building a text classifier on customer communications, or collecting preference feedback for RLHF from internal annotators, Argilla replaces both Ground Truth and the SageMaker RLHF workflow without leaving EU infrastructure.

Deploying on EU Infrastructure

All four tools above are deployable as Docker containers on standard EU VPS infrastructure. The operational pattern:

Provision a VPS in an EU data center — Frankfurt, Amsterdam, Paris, or Warsaw are all within EU jurisdiction and are not subject to CLOUD Act. Hetzner Cloud, Scaleway, OVH, and sota.io all offer options.
Deploy Label Studio or CVAT via Docker Compose — both projects ship official Compose files. Label Studio needs roughly 2 GB RAM and a persistent volume for dataset storage. CVAT needs more — 4 GB RAM minimum, plus a Redis and PostgreSQL side-car for production use.
Mount training data from an EU object store — MinIO on the same host, or an S3-compatible bucket from a EU-native provider (Scaleway Object Storage, Hetzner Object Storage) as the data source. No data leaves the EU.
Configure private labeler access — OIDC integration with your existing identity provider keeps the labeler roster in your Art.30 records without creating a new sub-processor.

For teams that prefer a managed path without managing the annotation infrastructure themselves: sota.io deploys Docker Compose stacks directly from a git push. You can version-control your Label Studio or CVAT configuration alongside your ML model code and let the platform handle the deployment.

Compliance Checklist: Ground Truth Replacement

Before going into production with an EU-native labeling pipeline:

Art.28 DPA in place with your annotation tool vendor (Label Studio Enterprise / Segments.ai) or confirm self-hosted path has no external processors
Sub-processor list updated — remove AWS as data processor for labeling; add EU infrastructure provider
DPIA completed if processing Art.9 data (biometrics, health, financial) at scale
Record of Processing updated (Art.30) — new processing purpose: "ML training data annotation"
SCCs not needed — EU-hosted labelers within EU jurisdiction; verify no Turk-equivalent third-country transfer
AI Act Art.26(9) assessment if the model trained on this data qualifies as a high-risk AI system under Annex III
Pseudonymization before export to labelers if personally identifiable detail is not required for the annotation task (e.g., blur faces in bounding-box-only tasks)
Retention policy for labeled datasets — GDPR Art.5(1)(e) storage limitation applies to training data that contains personal data
Active learning model artifacts stored in EU — if using CVAT or Label Studio active learning, ensure model weights are written to EU storage, not S3/SageMaker

Summary

AWS Ground Truth is a capable managed labeling service, but it introduces a processor chain that is difficult to reconcile with GDPR's Article 28, Chapter V, and Article 9 requirements when your training data involves personal data — and most production ML training data does.

The open-source alternatives — Label Studio for general tasks, CVAT for computer vision, Argilla for NLP/LLM — provide equivalent functionality with full infrastructure control when deployed on EU-native compute. Segments.ai covers the managed-workforce use case for teams that need annotators but want to stay within EU processor chains.

The compliance argument is not theoretical. German and Dutch DPAs have both indicated that the processor chain for AI training data is an active enforcement focus for 2026. Getting your labeling pipeline onto EU-sovereign infrastructure before your next model training run is the practical step.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Start free — no credit card View pricing