2026-04-30·11 min read

AWS Polly EU Alternative 2026: Text-to-Speech, Medical Voice Data, and the GDPR Problem

Post #736 in the sota.io EU Compliance Series

AWS Polly is Amazon's text-to-speech service. It converts written text into lifelike spoken audio using neural voices across more than 60 languages and language variants. For product teams building voice interfaces, accessibility layers, audiobook pipelines, or patient-communication systems, Polly offers a compelling combination of voice quality, API simplicity, and broad language coverage — all without requiring any speech engineering expertise.

That convenience masks a GDPR exposure that is structurally similar to the one in AWS Translate, but with an additional dimension that most security reviews miss entirely: voice data. The text that Polly synthesises is frequently personal data. When that text is clinical content — medication reminders, discharge instructions, diagnostic findings — it is special category personal data under Article 9 GDPR. And when Polly's Neural TTS Custom Voice feature is used to create a synthetic voice trained on a speaker's recordings, those recordings and the resulting voice model meet the Article 4(14) definition of biometric data — also a special category.

AWS Polly is operated by Amazon Web Services, Inc., a Delaware corporation headquartered in Seattle, Washington. The CLOUD Act (18 U.S.C. § 2713) allows US federal agencies to compel production of data controlled by US cloud providers regardless of the AWS region selected. Choosing eu-central-1 or eu-west-1 for Polly does not change the jurisdiction analysis — it changes the physical location of the servers, not the legal reach of US law.

This guide covers six GDPR exposure points that European teams must understand before sending sensitive text through the Polly API.

What AWS Polly Actually Does

At its simplest, Polly accepts a text string via the SynthesizeSpeech API and returns an audio stream — MP3, OGG, or PCM — in real time. For longer content, StartSpeechSynthesisTask runs synthesis asynchronously and stores the audio output in an S3 bucket you control. Both flows require the full input text to pass through AWS-managed synthesis infrastructure.

Beyond basic synthesis, Polly supports several advanced features that create additional data surfaces. SSML (Speech Synthesis Markup Language) lets callers embed prosody, pronunciation, pause, and emphasis controls directly in the input text, which is submitted as a structured XML document containing the full content to be spoken. Lexicons are custom pronunciation files — think brand names, medical acronyms, or specialist terminology — that can be uploaded to Polly via PutLexicon and reused across synthesis calls. Lexicons persist in your AWS account until explicitly deleted.

The most significant advanced feature from a data governance perspective is Custom Voices under Neural TTS. This feature allows organisations to create a synthetic voice that sounds like a specific person — a corporate narrator, a brand voice — by training a model on recordings of that speaker. The voice recordings and the trained model are biometric representations of that individual's speech characteristics.

Exposure Point 1: Article 9 Medical Voice Synthesis

The most legally consequential use of Polly in European organisations is healthcare communication: synthesising discharge summaries, post-operative care instructions, medication reminders, and diagnostic results into audio for patients who need spoken output — whether due to visual impairment, low literacy, or language accessibility requirements.

The text entering the Polly API in these workflows is health data — special category personal data under GDPR Article 9. Article 9(1) prohibits processing of health data unless a specific legal basis in Article 9(2) applies. For healthcare providers, the relevant basis is typically Article 9(2)(h) — processing necessary for the purposes of preventive or occupational medicine, medical diagnosis, or the provision of health or social care. But Article 9(2)(h) is not a blanket authorisation: it applies only to processing by or under the responsibility of a professional subject to the obligation of professional secrecy.

When a European hospital routes patient discharge summaries through AWS Polly in us-east-1 or eu-central-1, those documents are processed by AWS infrastructure. AWS is a sub-processor. The GDPR does not distinguish between data that stays in the EU and data on EU-located servers controlled by a US company — the relevant question is who controls and can access the data, and what legal instruments can compel disclosure.

The practical risk: A US Department of Justice National Security Letter issued to AWS could compel production of text that was submitted for synthesis — including patient identifiers, diagnoses, and treatment details — without prior judicial authorisation and without the patient or data controller being notified. This is structurally incompatible with the fiduciary obligations that healthcare providers carry under both GDPR and sector-specific data protection frameworks like German Landeskrankenhausgesetze.

Exposure Point 2: Custom Voice Models as Article 9 Biometric Data

GDPR Article 4(14) defines biometric data as "personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person." Article 9(1) lists biometric data as a special category requiring elevated protection.

AWS Polly Neural TTS Custom Voice creates a voice model by processing recordings of a specific speaker. The recordings themselves are biometric data — they capture the speaker's unique physiological and behavioural voice characteristics. The trained model is also biometric data because it encodes a mathematical representation of the speaker's voice that enables identification and synthesis.

Organisations that build custom corporate or brand voices using Polly are creating Art.9 biometric records of their narrators. Processing these records requires an explicit legal basis under Article 9(2) — most commonly, explicit consent from the speaker under Article 9(2)(a). The consent must be specific, informed, unambiguous, and freely given. It must cover not just the recording session but also the transfer of recordings to AWS for model training, the ongoing storage of the model in AWS infrastructure, and any subsequent CLOUD Act disclosure risk.

The accountability gap: Many organisations that use Custom Voice features have not obtained biometric data processing agreements with AWS that satisfy Article 28 GDPR, and have not conducted the Data Protection Impact Assessment (DPIA) required under Article 35 for large-scale processing of biometric data. The CLOUD Act exposure for voice model data is particularly significant because the model, once exfiltrated, could be used to impersonate the speaker.

Exposure Point 3: SSML Documents as Structured Personal Data Submissions

SSML (Speech Synthesis Markup Language) is an XML format that gives callers fine-grained control over synthesis output — prosody, rate, pitch, emphasis, pauses, phonemes, and language switching. In practice, SSML documents submitted to Polly often contain more than just prosody markup: they embed the full text to be spoken, including names, titles, clinical values, and contextual narrative.

A healthcare SSML document might look like:

<speak>
  <p>Dear <say-as interpret-as="name">Maria Schneider</say-as>, 
  your discharge summary from <say-as interpret-as="date" format="dm">12/04</say-as> 
  confirms a diagnosis of <lang xml:lang="de-DE">Typ-2-Diabetes</lang>. 
  Your next appointment is on 
  <say-as interpret-as="date" format="dmy">15/05/2026</say-as>.</p>
</speak>

The full SSML document — including the patient name, diagnosis, and date — is transmitted to AWS as the synthesis input. SSML does not abstract or anonymise personal data: it embeds it. Every personal identifier in the text-to-be-spoken is present in the API call.

Article 5(1)(c) data minimisation: GDPR requires that personal data is "adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed." Submitting full patient records formatted as SSML to a cloud TTS API — when a self-hosted alternative could process the same content on-premises — is a data minimisation question that healthcare DPOs need to answer explicitly.

Exposure Point 4: Asynchronous Synthesis and S3 Dual-Storage

For content longer than the synchronous API limit, Polly's StartSpeechSynthesisTask processes synthesis asynchronously. The input text is supplied in the API call and the audio output is written to a specified S3 bucket. This creates a data flow with two retention surfaces:

Processing retention: The input text is held in AWS systems during synthesis — duration and storage details not explicitly disclosed in Polly's documentation.
Output retention: The synthesised audio file in S3 is a direct encoding of the spoken content. Deleting the audio satisfies the request to erase "the audio file" but not necessarily all traces of the underlying content in AWS infrastructure.

Under GDPR Article 17, data subjects have the right to erasure of personal data. For audio files synthesised from medical records: the audio is personal data (it encodes the patient's information in the spoken content), and potentially biometric data if the voice itself uniquely identifies the patient (unusual but possible with custom voice synthesis of patient-dictated content). Organisations must be able to demonstrate that all copies — input, intermediate, and output — have been erased. AWS's data deletion guarantees for Polly synthesis intermediates are not granular enough to satisfy Article 17 with confidence.

The Art.17 gap: S3 lifecycle policies can automatically expire the output audio, but the timing of intermediate data deletion within the Polly synthesis pipeline is not contractually guaranteed to align with Article 17 requests. Organisations that have issued erasure responses to data subjects and assumed the Polly-processed audio is gone may be wrong.

Exposure Point 5: Lexicon Storage and Article 5(1)(e) Storage Limitation

Polly lexicons — custom pronunciation dictionaries — are uploaded via PutLexicon and stored in your AWS account. Lexicons can contain domain-specific vocabulary: medical brand names, drug names, patient-accessible clinical terminology, or even proper nouns that uniquely identify patients in context (hospital names, department names, consultant surnames combined with specialty identifiers).

GDPR Article 5(1)(e) requires that personal data is "kept in a form which permits identification of data subjects for no longer than is necessary." Lexicons that embed vocabulary uniquely associated with identifiable individuals — even indirectly — may constitute personal data under Article 4(1). If so, GDPR storage limitation principles require that they are deleted when the purpose for which they were uploaded no longer exists.

In practice, most organisations that use Polly lexicons treat them as persistent configuration — uploaded once, kept indefinitely. This is a reasonable operational choice but requires explicit justification under Article 5(1)(e) and documentation in Records of Processing Activities (RoPA) under Article 30. DPOs reviewing Polly integrations should audit what lexicons exist, what content they contain, and whether retention is justified.

Exposure Point 6: CLOUD Act and Healthcare Voice Content in Customer-Facing Applications

When European healthcare providers, insurers, or pharmaceutical companies use Polly to build patient-facing voice interfaces — appointment reminders by phone, IVR systems with personalised health information, bedside audio assistants — the voice content delivered to patients passes through AWS synthesis infrastructure for every personalised utterance.

The CLOUD Act risk in this context is not primarily about bulk data extraction. It is about targeted access: a US federal agency seeking information about a specific individual could issue process to AWS compelling production of synthesis records related to that person. In a healthcare context, this could expose diagnosis information, treatment history, and communication patterns — data that would be protected from disclosure under German medical confidentiality law (ärztliche Schweigepflicht) but may not be similarly protected from US compelled disclosure.

The CLOUD Act disclosure and Art.13/14 obligations: GDPR Articles 13 and 14 require that data subjects are informed of data transfers to third countries and the safeguards in place. When a patient receives a spoken medication reminder synthesised by AWS Polly, are they informed that their health data was processed by a US cloud provider subject to CLOUD Act jurisdiction? In most deployments, they are not. This is not a hypothetical compliance gap — it is a structural violation of transparency obligations that DPOs must remediate.

EU-Native Text-to-Speech Alternatives

Coqui TTS

Coqui TTS originated as a European open-source startup (Dublin, Ireland) before the company wound down operations in 2024. The codebase is MIT-licensed and actively maintained by the community as coqui-ai/TTS on GitHub. It supports more than 1,100 pre-trained models across 100+ languages, including German, French, Spanish, Dutch, and Polish. Coqui supports XTTS-v2, a high-quality multilingual neural TTS model comparable to commercial offerings.

Self-hosting Coqui on sota.io infrastructure means synthesis runs entirely within EU jurisdiction — no data leaves your deployment. For German healthcare, this eliminates the CLOUD Act exposure entirely. GPU acceleration is supported but not required for most production workloads.

Piper TTS

Piper is a fast, local neural TTS system developed as part of the Rhasspy project. It uses VITS-based neural synthesis with ONNX Runtime for inference, making it suitable for deployment on modest CPU hardware without GPU requirements. Piper is MIT-licensed and supports a wide range of European languages with high-quality voices trained on LibriTTS, Thorsten, and other EU-origin voice corpora.

Piper's latency profile — typically under 200ms for short utterances on modern server hardware — makes it practical for real-time applications like IVR systems or voice assistants. Because inference is local, no text data is transmitted to any external service.

MaryTTS (DFKI)

MaryTTS is a Java-based open-source TTS platform developed by the Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), a German AI research institution. It is LGPL-licensed and has been in active development for more than 20 years, with production deployments in European healthcare and academic contexts.

MaryTTS supports German, English, French, Italian, and Turkish, with a research-grade voice building toolkit that allows organisations to train custom voices on their own voice corpora. Because DFKI is a German institution and MaryTTS is self-hosted, there is no US jurisdiction exposure.

Festival Speech Synthesis System

Festival is an open-source TTS framework developed at the Centre for Speech Technology Research at the University of Edinburgh (EU academic institution prior to Brexit, Scottish legal jurisdiction). It is widely available in Linux package repositories and supports a plugin architecture for voice modules. Festival is less suitable for high-quality production synthesis than Coqui or Piper but remains an option for batch document-to-audio workloads where synthesis quality is secondary to compliance.

Mozilla TTS (Abandoned Upstream, Community Fork Active)

Mozilla TTS was Mozilla Corporation's open-source TTS project. Mozilla discontinued active development, but the codebase — which formed the basis for Coqui TTS — remains available and is maintained by the community. For new deployments, Coqui TTS (the direct successor) is preferred. Mozilla TTS is mentioned here because many existing European deployments reference it by name.

Mimic3 (Mycroft AI)

Mimic3 is a neural TTS engine from Mycroft AI, AGPL-licensed and designed for offline, privacy-preserving synthesis. It supports 200+ voices across 30+ languages and is optimised for low-latency on-device inference. Mycroft AI is a US company but Mimic3 is a fully self-hosted system — the synthesis model weights are downloaded once and inference runs locally, so no data is transmitted to Mycroft or any US service.

Deploying EU-Native TTS on sota.io

All of the above alternatives are containerisable and deployable on EU-sovereign infrastructure. sota.io provides managed deployment of containerised workloads on European servers without US parent-company control chains. A typical Coqui TTS or Piper deployment on sota.io consists of:

A container image built from the official model (e.g., ghcr.io/coqui-ai/tts or a Piper-based Docker image)
A synthesis API service (Flask, FastAPI, or native REST endpoints)
An HTTPS endpoint that accepts text and returns audio — functionally equivalent to the Polly SynthesizeSpeech API

The resulting stack is architecturally identical to a Polly integration but with synthesis occurring on EU-sovereign infrastructure under GDPR-compliant jurisdiction. There is no CLOUD Act exposure. There is no US data transfer to disclose under Articles 13/14. And there is no third-party retention of synthesis intermediates that complicates Article 17 erasure requests.

For healthcare and regulated-sector deployments specifically, a self-hosted TTS stack is the only architecture that satisfies the combination of Art.9 obligations, Art.17 erasure rights, Art.13/14 transparency requirements, and Art.25 privacy-by-design principles — without relying on Standard Contractual Clauses that do not resolve CLOUD Act jurisdiction.

Summary

Risk	AWS Polly	EU Self-Hosted (Coqui / Piper / MaryTTS)
Art.9 medical text processing	US jurisdiction	EU jurisdiction
Custom voice = biometric data (Art.9)	AWS-controlled model	Self-controlled model
SSML personal data in API calls	Transmitted to AWS	Processed locally
S3 async synthesis Art.17 erasure gap	Partial guarantee	Full control
Lexicon persistence Art.5(1)(e)	Stored in AWS account	Local configuration
CLOUD Act compelled disclosure	Legally exposed	Not applicable
Art.13/14 disclosure obligation	Required, often missing	No third-country transfer

AWS Polly is an excellent product for teams that can accept US-jurisdiction data processing. For European organisations in healthcare, legal, financial services, or any regulated sector where special category data appears in synthesised text, the jurisdiction analysis makes Polly structurally incompatible with GDPR obligations — regardless of which AWS region is selected.

The EU-native alternatives — particularly Coqui TTS and Piper — have reached production quality sufficient for the majority of voice synthesis workloads. Deployed on EU-sovereign infrastructure, they provide equivalent functionality without the compliance exposure.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View plans