2026-04-30·12 min read·

AWS Translate EU Alternative 2026: Machine Translation, Medical Text, and the GDPR Problem

Post #735 in the sota.io EU Compliance Series

AWS Translate is Amazon's neural machine translation service. It translates text and documents across more than 75 languages and language variants in real time — without any model training or ML infrastructure required. For teams handling multilingual content at scale — patient communication platforms, international e-commerce, legal translation workflows, HR documentation, customer support systems, or multilingual compliance portals — Translate offers an attractive combination of quality, throughput, and operational simplicity.

That simplicity obscures a GDPR complexity that is easy to miss. Translation workloads are rarely discussed in the same security review conversations as databases or identity systems, but the content sent through a translation API is often the most sensitive text in an organisation: medical diagnoses translated for patients who speak a different language from their care team, legal submissions translated for cross-border proceedings, HR records translated for multinational employees, and internal financial documents translated for international stakeholders. When AWS Translate processes this content, it is processing personal data — and frequently, special category personal data under Article 9 GDPR — through an API operated by Amazon Web Services, Inc., a Delaware corporation headquartered in Seattle, Washington.

The CLOUD Act (18 U.S.C. § 2713) gives US federal agencies the power to compel production of data held by US-controlled cloud providers regardless of where that data is physically stored. This means that choosing an AWS EU region for Translate does not resolve the jurisdiction problem — the data remains under US legal reach as long as AWS controls the infrastructure.

This analysis covers six GDPR exposure points in AWS Translate that European development teams need to understand before sending sensitive content through the API.

What AWS Translate Actually Does

At its core, Translate accepts text via the TranslateText API and returns a translated string. For larger workloads, StartTextTranslationJob processes documents asynchronously with source and target files in Amazon S3. Beyond basic translation, Translate offers several advanced features that create additional data surfaces.

Custom terminology lets you upload domain-specific glossaries — product names, medical terms, legal concepts — that override the model's default translations for specific tokens. These files are uploaded and stored in your AWS account via ImportTerminology and persist indefinitely until explicitly deleted. Parallel data is a more powerful customisation feature: you upload sets of bilingual document pairs to train a customised translation output style adapted to your organisation's domain and tone. Both features require uploading your content to AWS-controlled infrastructure.

Document translation extends the API beyond plain text to handle Word documents, PowerPoint presentations, and HTML files while preserving formatting. Documents are either provided inline or loaded from S3, processed on AWS infrastructure, and results returned synchronously or stored back in S3. Every document entering this pipeline is subject to US jurisdiction for the duration of processing and any retention period AWS applies.

Exposure Point 1: Article 9 Medical Translation

The most legally significant AWS Translate use case in European healthcare is patient-facing translation: translating discharge summaries, medication instructions, diagnosis letters, and care plans for patients whose primary language differs from the clinical language of the treating institution.

Under GDPR Article 9, health data is a special category requiring either explicit consent (Article 9(2)(a)) or a specific legal basis such as healthcare provision (Article 9(2)(h)). This elevated protection applies regardless of the format — a diagnosis letter translated from German to Turkish is still health data throughout the translation process.

When a hospital or health-tech platform sends patient letters through TranslateText, the full medical content of that letter — diagnoses, medications, treatment plans, prognoses — is transmitted to and processed by AWS infrastructure under US jurisdiction. The cloud provider becomes a data processor under Article 28, and the data processor's country of incorporation determines the applicable law for compelled disclosure. For AWS, that law is US federal law including the CLOUD Act.

European health systems operating under sector-specific regulations (German Krankenhaus-Zukunftsgesetz, French HDS certification requirements, UK NHS Data Security and Protection Toolkit) face a specific problem: many of these frameworks require health data to remain under EU-controlled processing chains. Using AWS Translate for patient-facing translation creates a US-processor dependency that may be incompatible with these requirements regardless of AWS's GDPR commitments.

Legal service providers — law firms, in-house legal departments, legal-tech platforms — increasingly use machine translation to manage cross-border work: translating contracts for review, court submissions for international proceedings, regulatory filings for cross-border enforcement cases, and internal legal memos for multinational teams.

Legal documents frequently contain content subject to attorney-client privilege or legal professional privilege. These protections exist specifically to prevent compelled disclosure of confidential legal communications. The CLOUD Act creates a tension with these protections that has not been resolved: US authorities can issue a warrant to AWS for data stored in its infrastructure, including the translated content of documents that carry legal privilege under EU law.

The risk is not theoretical. StartTextTranslationJob processes entire documents — not just the structural or non-privileged content. A contract sent through batch translation may include negotiation positions, legal strategy memos embedded as comments, and privileged legal opinions that have been incorporated into the document body. The entire document, including its privileged content, transits AWS infrastructure under US legal reach.

EU legal services regulation is moving toward stricter data sovereignty requirements for legal data. The European Parliament's AI Act discussions include considerations for legal AI tools, and several EU bar associations have issued guidance recommending against use of US-hosted cloud services for privileged client communications.

Exposure Point 3: Custom Terminology as Persistent Data Upload

Custom terminology files uploaded via ImportTerminology create a persistent data asset in AWS-controlled infrastructure. These files are typically CSV or TMX format documents containing source terms and their preferred target translations across one or more language pairs.

For organisations in regulated industries, custom terminology files are rarely generic. A pharmaceutical company's custom terminology contains proprietary drug names and clinical trial vocabulary. A law firm's glossary contains client-specific legal terms that may reveal the nature of matters under representation. A financial institution's terminology contains internal account classifications, product codes, and regulatory category mappings that constitute non-public business information.

Under GDPR Article 5(1)(e), personal data must not be stored longer than necessary for the purpose for which it was processed. AWS's data retention documentation for custom terminologies does not clearly specify retention periods for terminology files after the customer account is closed or the terminology is deleted. The burden of verifying complete deletion rests with the data controller — the organisation using Translate — but the controller has no technical means to independently verify that AWS has deleted the data from all backup systems.

The sub-processor chain is also relevant here. AWS processes custom terminology data using its infrastructure, which includes shared services and potentially other AWS-controlled entities. Each entity in this chain must be identified in the Article 30 Record of Processing Activities, and data subjects have a right to be informed about sub-processors under Articles 13 and 14. For terminology files that contain personal data — names, job titles, or identifiers used as terminology context — these obligations apply.

Exposure Point 4: Parallel Data Training and Article 6 Lawful Basis

AWS Translate's parallel data feature allows organisations to upload bilingual document pairs to customise the model's translation style for their domain. These document pairs are stored in AWS-controlled infrastructure via the CreateParallelData API and used to adapt the base model output.

Parallel data documents are, by definition, full documents. Unlike custom terminology (which contains only specific terms), parallel data uploads include complete source and target texts from your existing document corpus. For organisations with GDPR-regulated content, this means uploading the original documents and their translations to AWS infrastructure for model adaptation purposes.

The GDPR Article 6 lawful basis question is: what is the legal basis for AWS processing these documents as training data for model customisation? AWS's terms of service for Translate state that customer content is not used to train Amazon's AI services without explicit opt-in, and that parallel data is used only for the customer's specific customisation. However, the underlying model infrastructure that processes parallel data during the customisation job is operated by AWS, and the documents transit AWS-controlled compute infrastructure.

For organisations relying on Article 6(1)(b) (contract performance) or Article 6(1)(c) (legal obligation) as their lawful basis for translation, extending that processing to include parallel data upload for model customisation may require a separate lawful basis assessment. The customisation purpose is not inherent to the core translation service but is an additional processing activity requiring its own justification.

Exposure Point 5: Document Translation and S3 Dual-Storage Risk

The StartTextTranslationJob API for batch document translation creates a multi-stage data surface. Source documents must be stored in an S3 bucket that AWS Translate can access. The translation job reads documents from source S3, processes them through Translate's infrastructure, and writes results to a designated output S3 bucket.

This architecture creates what GDPR practitioners call a dual-retention problem. The source document exists in S3 before the job, the job processes it through Translate compute infrastructure, and the translated document exists in the output S3 bucket after the job. The controller is responsible for managing deletion from both S3 locations and ensuring Translate has not created intermediate copies during processing.

AWS's documentation for batch translation jobs does not specify the exact storage lifecycle of documents during job processing — whether intermediate copies are created, how long processing state is retained, and whether job metadata referencing document content is stored separately from the documents themselves. The data controller has no independent visibility into these implementation details.

Under Article 5(1)(e) storage limitation and Article 17 right to erasure, the controller must be able to ensure complete deletion of personal data. When documents transit a processing pipeline spanning multiple AWS services (S3 source bucket → Translate compute → S3 output bucket), with potential intermediate states that the controller cannot inspect, demonstrating complete erasure is technically challenging.

Exposure Point 6: Real-Time Translation in Customer-Facing Applications

TranslateText is the synchronous API used in real-time applications: customer support chat translation, live meeting transcription translation, and dynamic content localisation. Unlike batch processing, real-time translation has a different risk profile — the data volume per call is smaller, but the frequency is higher and the content is often more sensitive because it reflects live human communication.

Customer support chat translation is a common enterprise use case. Support agents communicate in one language; customers communicate in another; AWS Translate mediates the exchange in real time. The translated text is transmitted to and from AWS infrastructure for every message in the conversation. For a customer support channel handling medical device inquiries, financial services complaints, or healthcare consultations, the real-time translation pipeline processes a continuous stream of sensitive personal data through US-controlled infrastructure.

The consent question is relevant here. Under GDPR Articles 13 and 14, data subjects must be informed that their communications are being translated using a cloud service, who that service provider is, and that their data is processed under US jurisdiction. Many customer support deployments that use real-time machine translation have not implemented these disclosure obligations. The translation layer is typically invisible to the customer — they see a response in their language without being informed that their message was transmitted to an external service provider for translation.

EU-Native Alternatives for GDPR-Compliant Machine Translation

LibreTranslate

LibreTranslate is a free and open-source machine translation API that can be deployed entirely on your own infrastructure. The server application is released under the AGPL-3.0 licence and supports 30+ language pairs via the Argos Translate engine. The REST API is intentionally designed to be compatible with AWS Translate's interface, making migration straightforward.

Self-hosting LibreTranslate on EU infrastructure via a platform like sota.io means your translation requests never leave your controlled environment. For organisations subject to Article 9 constraints on health data or legal privilege constraints on legal content, this eliminates the US jurisdiction exposure entirely. LibreTranslate supports custom language models, which can be added as .argosmodel packages without uploading training data to any third-party infrastructure.

OPUS-MT and Helsinki-NLP Models

The University of Helsinki's Language Technology group maintains OPUS-MT, a collection of open neural machine translation models trained on the OPUS parallel corpus. These models are available on HuggingFace under the CC-BY 4.0 licence and cover a large number of language pairs with particular strength in European languages.

OPUS-MT models can be deployed using the Hugging Face transformers library with the MarianMT architecture. A typical deployment requires a GPU-enabled container for production throughput; for lower-volume workloads, CPU inference is feasible. Hosting these models on EU infrastructure provides complete data sovereignty — the models are weights files that run entirely in your container without network calls to external services.

The Helsinki NLP group is an EU academic institution, and their model weights are published under permissive licences. This satisfies not just the technical data sovereignty requirement but also the supply-chain question: the model itself was created within the EU research ecosystem.

Argos Translate

Argos Translate is a Python library providing offline machine translation using OpenNMT-based models. It runs entirely locally without any network requirements, making it the highest-isolation option for sensitive content. The library is released under the MIT licence and packages translation models as .argosmodel files that can be distributed through controlled channels.

For applications that cannot afford any network exposure for sensitive content — offline clinical documentation systems, air-gapped legal environments, or on-premises HR platforms — Argos Translate provides machine translation as a pure local dependency. Quality is generally lower than cloud services or larger self-hosted models, but for many document types the quality is sufficient, particularly for European language pairs.

DeepL (EU Commercial Option)

DeepL GmbH is a German company headquartered in Cologne. Unlike the open-source options above, DeepL is a commercial service with a paid API. However, as a German entity under EU jurisdiction, DeepL processes data under GDPR without the CLOUD Act exposure that applies to US providers.

DeepL offers GDPR-compliant data processing agreements and does not use submitted text to train its models unless customers explicitly opt in. For organisations that require commercial SLA guarantees, enterprise support, and higher translation quality than current open-source alternatives provide, DeepL is a credible EU-native alternative that eliminates US jurisdiction exposure while maintaining the operational simplicity of a managed API.

DeepL is not a complete substitute for AWS Translate's full feature set — it lacks batch document translation at scale, the custom terminology API (though Pro accounts have some glossary support), and the tight AWS ecosystem integration. But for the core use case of translating text through an API, it provides comparable quality under EU jurisdiction.

Self-Hosted on sota.io

All of the above open-source options — LibreTranslate, OPUS-MT, and Argos Translate — can be containerised and deployed on sota.io, a EU-native PaaS that runs entirely on European infrastructure. This means no CLOUD Act exposure, no US sub-processors, and data processing that stays within the EU legal framework throughout the entire translation pipeline.

A typical LibreTranslate deployment on sota.io requires a single container with the LibreTranslate server, with optional GPU instance selection for higher throughput. The LibreTranslate REST API is designed to be drop-in compatible with many AWS Translate integrations, reducing migration effort.

Migration Considerations

For real-time translation: Replace TranslateText calls with calls to a LibreTranslate or Argos Translate instance. The API surface is similar — both accept source text and return translated text. Custom terminology can be migrated to LibreTranslate's glossary feature.

For document translation: OPUS-MT models accessed via the Hugging Face Transformers library provide a Python-native document translation pipeline. Documents should be chunked into sentences or paragraphs before translation; the sentence-splitter library can handle this for most European languages.

For parallel data customisation: Open-source fine-tuning of MarianMT models on your domain-specific parallel corpus is well-documented in the Hugging Face training tutorials. Fine-tuned model weights stay on your infrastructure and are never transmitted to a third-party provider.

For integration with existing pipelines: LibreTranslate's API endpoint structure is close enough to AWS Translate that a thin adapter layer is sufficient for most integrations. The source_lang, target_lang, and q fields map directly to AWS Translate's SourceLanguageCode, TargetLanguageCode, and Text parameters.

GDPR Documentation Requirements

Organisations replacing AWS Translate should update their Article 30 Records of Processing Activities to reflect the new processor configuration. Key changes:

The elimination of AWS as a processor for translation workloads also simplifies the transfer mechanism documentation. For organisations that previously relied on Standard Contractual Clauses to legitimise transfers of Article 9 data to AWS (an approach of questionable legal certainty post-Schrems II), self-hosted EU translation removes the need for SCCs entirely for this processing activity.


Running machine translation under EU jurisdiction is straightforward with current open-source tooling. sota.io provides the EU-native infrastructure to host LibreTranslate, OPUS-MT, or any containerised translation service — without CLOUD Act exposure, without US sub-processors, and without the GDPR documentation complexity that comes with every AWS API call.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.