2026-05-01·13 min read·

AWS Transcribe EU Alternative 2026: Voice Data, GDPR Article 9, and the CLOUD Act Problem

Post #741 in the sota.io EU Compliance Series

AWS Transcribe is Amazon's managed automatic speech recognition (ASR) service. Contact centers use it to transcribe customer calls and power quality assurance workflows. Healthcare providers use Transcribe Medical to convert clinical dictations into structured notes. Legal teams use it to transcribe proceedings, depositions, and recorded interviews. Developers use it to add voice interfaces to applications.

The service handles two distinct categories of audio: general transcription for business audio, and Transcribe Medical for healthcare-specific content using specialized clinical vocabulary and models. Both share the same infrastructure: Amazon Web Services, Inc., a Delaware corporation operating under US jurisdiction.

AWS runs Transcribe in European regions: eu-west-1 (Ireland), eu-central-1 (Frankfurt), eu-west-3 (Paris). Audio files processed through Transcribe typically reside in S3 buckets in those same European regions. Teams handling GDPR-regulated audio often treat this as compliant configuration.

It is not. The CLOUD Act (18 U.S.C. § 2713) compels US-incorporated companies to produce data stored anywhere in the world when ordered by US authorities. A valid government order served on Amazon in Seattle can reach your transcription jobs, audio files, custom language models, and Call Analytics outputs in Frankfurt. For speech data — which GDPR treats as biometric data when used for identification — this CLOUD Act exposure carries consequences that standard cloud services do not.

Most cloud services process data that falls clearly into ordinary GDPR categories. Voice is different.

GDPR Art.4(14) defines biometric data as "personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person." Voice characteristics satisfy this definition: pitch patterns, cadence, linguistic habits, and acoustic fingerprints are person-specific physical characteristics that can be used for identification.

This does not mean that every audio recording is automatically biometric data under Art.9. GDPR Recital 51 clarifies that biometric processing only triggers special-category protection "when processed through a specific technical means allowing the unique identification or authentication of a natural person." The classification depends on what you do with the audio, not just the fact of recording.

AWS Transcribe does exactly the processing that triggers biometric classification for some use cases. Speaker diarization — Transcribe's feature for labeling which speaker said what — performs acoustic analysis to distinguish speakers. When diarization identifies Speaker 1 and Speaker 2 across a call recording, it is performing technical processing of physical vocal characteristics. If those speaker labels are subsequently linked to known individuals (which call center deployments almost always do), the acoustic analysis has performed biometric identification.

Transcribe Medical introduces a second special-category dimension: health data under Art.9(1). Clinical dictations, patient consultations, therapy sessions, and diagnostic discussions are health data by definition. When a healthcare organization processes them through AWS Transcribe Medical, it is transferring special-category health data to a US-jurisdiction processor.

The combination — voice biometrics (Art.9 biometric) and health content (Art.9 health data) — makes medical transcription use cases among the most legally constrained in enterprise cloud computing.

What AWS Transcribe Stores About Your Audio

Understanding the GDPR exposure requires mapping what Transcribe retains beyond the transcription itself.

Transcription Job Records

Every Transcribe API call creates a transcription job with associated metadata: job name, status, language code, media file URI, media format, output S3 bucket, creation timestamp, completion timestamp, and configuration parameters. These job records persist in Transcribe's internal state independently of whether you retain the input audio or output transcript.

The transcription job record does not contain the audio content itself, but it contains the S3 URI of the audio file and the location of the output. It is a durable pointer to where personal data was processed — and in healthcare contexts, a record that a specific clinical interaction was processed at a specific time.

Output Transcripts: Structured Personal Data at Scale

Transcribe outputs are not simple text files. A Transcribe JSON output includes:

Full transcript text — the complete conversation in text form
Per-word confidence scores — floating-point confidence values for every word
Word-level timestamps — start and end time in milliseconds for every word
Speaker labels (if diarization enabled) — speaker assignment per word
Channel identification (for multi-channel audio) — audio source per word
Alternative transcriptions — lower-confidence alternative words Transcribe considered
Custom vocabulary matches — records of when custom vocabulary terms were matched

The word-level timestamps create a behavioral fingerprint: the exact timing of speech, pauses, hesitations, and interruptions in a conversation. For a one-hour call center interaction, this is several hundred kilobytes of structured data about the conversation dynamics — data that was not explicitly captured in the original audio request but is generated and stored by Transcribe.

Custom Vocabularies and Custom Language Models

Transcribe allows teams to upload custom vocabularies — lists of domain-specific words, proper nouns, brand names, and technical terms that the base ASR models might misrecognize. In healthcare deployments, custom vocabularies typically contain drug names, procedure codes, physician names, and facility identifiers.

Custom vocabularies are stored persistently in the Transcribe service. They are not session-scoped. A custom vocabulary uploaded on day one remains in the Transcribe service until explicitly deleted. In GDPR terms, this is a long-term retention of domain-specific identifiers under a US-jurisdiction processor with no documented retention limit.

Custom Language Models (CLMs) go further. A CLM is a custom ASR model trained on your organization's audio and text data. You provide training data — potentially hours of recorded conversations, internal documents, domain-specific corpora — and Transcribe trains a specialized model. This trained model is stored in Transcribe and linked to your AWS account.

The GDPR Art.17 problem with custom language models parallels the problem documented for AWS SageMaker and AWS Personalize: a trained model encodes statistical patterns from its training data. Deleting the training data does not delete the model. Deleting the model requires explicit action. If the training data included personal data — names, voice recordings, clinical content — that data's influence persists in the model weights after the source files are deleted.

Transcribe Call Analytics: Inference Beyond Transcription

AWS Transcribe Call Analytics is a separate product layer that processes customer service call recordings and adds automated analysis: sentiment detection, interruption counts, non-talk time measurement, loudness tracking, issue detection, and call categorization.

This analysis transforms audio recordings into behavioral and emotional inferences about individual callers: their emotional state during the call, whether they were interrupted, how long the agent kept them waiting. These inferences are generated and stored without separate opt-in for the inference step. A team that enables Call Analytics to improve contact center quality is implicitly authorizing AWS to generate sentiment profiles of every customer call.

Under GDPR Art.5(1)(b), the purpose of data collection should be "specified, explicit and legitimate." Collecting audio for transcription purposes and then enabling sentiment inference on that audio represents purpose extension — the secondary analytical processing may not have been disclosed to call participants in the original consent or privacy notice.

Voice recordings processed through speaker diarization may constitute biometric data under Art.9. Transcribe Medical processing is health data under Art.9. Both categories require either explicit consent or another Art.9(2) basis (employment law, vital interests, public health, etc.). The standard Art.6(1)(b) contractual necessity or Art.6(1)(f) legitimate interests grounds for processing are insufficient — Art.9 requires the higher threshold.

Many deployments use Transcribe for content where Art.9 applies (therapy sessions, HR disciplinary proceedings, legal depositions, medical consultations) but rely on general Art.6 lawful bases that do not satisfy Art.9 requirements. This is a structural compliance gap.

2. CLOUD Act — Real-Time Streaming Transcription

AWS Transcribe supports two processing modes: batch transcription (submit an audio file, receive results asynchronously) and streaming transcription (send audio chunks in real-time over WebSocket or HTTP/2, receive transcript chunks in real-time).

Streaming transcription creates a CLOUD Act exposure window that is qualitatively different from batch processing. When you stream audio to Transcribe in real-time — a live customer call, a physician dictation during a patient visit, a legal deposition — the audio leaves EU jurisdiction at the moment of transmission. There is no buffer, no review window, no opportunity to assess CLOUD Act risk before the data transfers.

For batch processing, you could in theory review each audio file before submission. For streaming transcription integrated into a contact center or live transcription tool, the transmission is architecturally instantaneous and continuous. The CLOUD Act risk materializes with every audio chunk.

AWS runs Transcribe streaming endpoints from AWS infrastructure. Even when routing to eu-central-1, the receiving infrastructure is operated by the US-incorporated AWS entity. Under current CLOUD Act interpretation, real-time audio processing by AWS constitutes processing by a US entity regardless of which AWS region hosts the API endpoint.

3. Article 17 — Erasure Incompleteness

GDPR Art.17 grants data subjects the right to erasure — the "right to be forgotten." For an audio recording processed through Transcribe, full erasure requires:

Deleting the source audio file from S3
Deleting the Transcribe transcription job and its output
Deleting Call Analytics jobs and their outputs (if used)
Deleting custom vocabularies if they contain personal identifiers
Deleting or retraining any Custom Language Models trained on data that includes the individual

Steps 1-4 are straightforward API calls. Step 5 is not. A CLM is a trained model that encodes statistical patterns from its training corpus. If that corpus included audio from a data subject who subsequently requests erasure, the model weights reflect patterns learned from their speech. AWS does not provide a mechanism to selectively remove individual data subjects' influence from a trained CLM. The only complete erasure is deleting the model entirely and retraining without the requesting individual's data.

For medical deployments using Transcribe Medical with CLMs trained on patient audio, a single Art.17 request from a former patient can require retraining a custom model from scratch — a compute-intensive, time-consuming operation that may not be achievable within the Art.17(1) "without undue delay" requirement.

4. Article 5(1)(e) — Storage Limitation

Transcribe transcription jobs and their outputs accumulate indefinitely unless explicitly deleted. There is no configurable retention period within the Transcribe service itself. The S3 output buckets can be configured with lifecycle rules, but the Transcribe job records persist in Transcribe's internal state regardless.

For high-volume contact center deployments transcribing thousands of calls per day, uncleaned Transcribe job records accumulate rapidly. Each job record links an audio file URI (potentially containing personal data) to a transcript output location. This is a durable index of personal data processing events with no automatic expiry.

5. Article 5(1)(b) — Purpose Limitation in Call Analytics

Call Analytics generates inferences that were not part of the stated purpose of recording. If a call center's privacy notice states that calls are recorded "for quality assurance and training," and Call Analytics generates sentiment scores, loudness measurements, and behavioral categories from those calls, those inferences may exceed the stated purpose.

Sentiment classification of customer calls creates a behavioral and emotional profile of customers — data that could be used for customer segmentation, pricing decisions, or service tiering. If that downstream use was not disclosed at the time of recording, the processing violates Art.5(1)(b).

6. Article 25 — Privacy by Design Defaults

Transcribe does not apply PII redaction by default. PII entity detection and redaction is an opt-in configuration. Deployments that omit the PII redaction configuration generate and store transcripts containing names, phone numbers, email addresses, account numbers, and other identifiers in full text.

Transcribe Medical has no PII redaction capability at all. The service documentation explicitly states that Medical Transcribe does not support PII identification or redaction. Health data transcriptions are output in full, with no automated masking of patient names, provider names, or other identifiers.

Art.25(2) requires that "by default" only personal data necessary for the specific purpose is processed. Transcribing audio without redacting identifiers that are not needed for the downstream use case — a common configuration — is a default that fails the Art.25(2) data minimisation requirement.

The CLOUD Act Across Transcribe's Three Data Layers

Transcribe creates CLOUD Act exposure across three distinct data layers:

Layer 1 — Audio Storage. Source audio files in S3 remain under AWS (US) jurisdiction regardless of which AWS region hosts the bucket. The audio file is the most sensitive layer: it contains the speaker's voice, which may be biometric data, and the content of what was said.

Layer 2 — Transcription Artifacts. Transcription job records, output transcript files, Call Analytics outputs, and custom vocabularies are stored in Transcribe's internal state and the designated S3 output bucket. These are under the same CLOUD Act jurisdiction as the source audio.

Layer 3 — Trained Models. Custom Language Models trained on your data are stored in Transcribe and linked to your AWS account. A CLOUD Act order compelling production of model artifacts would expose the statistical patterns learned from potentially thousands of hours of audio from data subjects across your organization.

For general business use cases — transcribing sales calls or product feedback interviews — this three-layer exposure is a material compliance risk. For healthcare, legal, HR, or public sector use cases, it is a significant legal liability.

EU Alternatives: Self-Hosted and EU-Sovereign Options

The most robust EU-compliant alternative for speech-to-text is self-hosted open source ASR. The open source speech recognition ecosystem has matured significantly since 2020, and production-grade options exist for general and medical transcription.

OpenAI Whisper (Self-Hosted)

OpenAI Whisper is an open source automatic speech recognition model released in September 2022. It supports 99 languages with strong accuracy, operates without an internet connection when deployed locally, and produces transcripts with word-level timestamps. The model weights are publicly available.

Running Whisper on EU infrastructure — a VPS in Frankfurt, a dedicated GPU server, or a container deployment on sota.io — means that audio files never leave EU jurisdiction. The model runs on your hardware; no audio is transmitted to a third party.

Whisper is available in five sizes: tiny, base, small, medium, and large-v3. The large-v3 model achieves word error rates competitive with AWS Transcribe for major European languages. For GPU-accelerated inference, a single NVIDIA A10G can process hours of audio per minute. CPU-only inference is practical for small volumes with the small or base models.

Whisper.cpp is a C++ port of Whisper with significantly reduced memory footprint. It runs on CPU without GPU requirements, making it suitable for edge deployments or lower-cost EU cloud instances.

For the API-compatible deployment pattern, faster-whisper provides a Python library with significantly improved throughput over the original Whisper inference code. It can be wrapped in a REST API and deployed behind an nginx reverse proxy to expose an endpoint structurally similar to the AWS Transcribe batch API.

Vosk

Vosk is an offline speech recognition toolkit with a smaller model footprint than Whisper. Pre-trained models are available for over 20 languages, including German, French, Spanish, Italian, Portuguese, Polish, and Dutch — all common enterprise languages in the EU.

Vosk models range from 40MB to 1.8GB. The lighter models are suitable for embedded or resource-constrained deployments: edge devices, IoT, or applications where loading a 3GB Whisper large-v3 model is impractical.

Vosk uses the Kaldi ASR toolkit as its backend. It supports both batch and real-time streaming transcription via WebSocket, making it a functional equivalent for Transcribe's streaming API.

Kaldi

Kaldi is the research-grade ASR framework that underlies much of the commercial ASR industry. It is not a user-friendly deployment option, but it represents the EU alternative with the most capability for organizations with specialized audio (courtroom transcription, medical dictation, domain-specific vocabulary).

Fine-tuning Kaldi models on domain-specific audio achieves accuracy levels comparable to AWS Transcribe Medical for clinical vocabulary. Organizations with existing labeled audio corpora can train EU-sovereign medical ASR models using Kaldi without dependency on any US cloud service.

NVIDIA NeMo

NVIDIA NeMo is an enterprise-grade conversational AI toolkit that includes ASR, text-to-speech, and natural language processing components. NeMo supports training and deploying ASR models on EU infrastructure, including multi-GPU configurations for high-throughput production transcription.

NeMo supports CTC and attention-based encoder-decoder architectures, streaming transcription via RIVA (NVIDIA's production ASR serving platform), and speaker diarization. For organizations that need the feature parity of AWS Transcribe Call Analytics — speaker separation, real-time streaming, sentiment analysis — NeMo with EU-hosted GPU infrastructure is the closest self-sovereign equivalent.

For Medical Transcription

Medical ASR is the hardest EU alternative case because specialized medical vocabulary requires either a pre-trained medical model or fine-tuning on clinical audio. Options:

Fine-tuned Whisper for clinical audio: The Whisper architecture can be fine-tuned on clinical audio corpora. Open source clinical audio datasets are available (MedSpeech, clinical trial recordings). A fine-tuned Whisper medical model deployed on EU infrastructure achieves Transcribe Medical-level accuracy for major specialties.

Custom vocabulary injection: All self-hosted options above support custom vocabulary files. Loading a medical vocabulary (ICD-10 codes, drug names, procedure terminology) into Whisper or Kaldi without sending it to a US-jurisdiction service eliminates the custom vocabulary GDPR risk documented above.

Nuance DAX / Dragon Medical: Nuance (now Microsoft) offers Dragon Medical One as a hosted medical transcription service. Note that Microsoft is also a US corporation subject to CLOUD Act. Dragon Medical Server is available as an on-premise deployment that eliminates the CLOUD Act dimension, but requires significant infrastructure investment.

Deploying Whisper on EU Infrastructure

The practical self-hosted alternative for most organizations is Whisper with a REST API wrapper. The deployment architecture:

Audio File (EU S3 / NFS)
    ↓
Whisper API Container (EU VPS / sota.io)
    ↓
Transcript JSON (EU database / S3)

A minimal Whisper API using FastAPI:

from fastapi import FastAPI, UploadFile
import whisper
import tempfile
import os

app = FastAPI()
model = whisper.load_model("large-v3")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".audio") as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name
    try:
        result = model.transcribe(tmp_path, word_timestamps=True)
        return {"transcript": result["text"], "segments": result["segments"]}
    finally:
        os.unlink(tmp_path)

This API accepts audio files, runs Whisper inference locally, and returns transcripts with segment timestamps. No audio leaves the EU server. Deployable as a Docker container on any EU VPS.

For streaming transcription using faster-whisper with VAD (voice activity detection):

from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")

def transcribe_stream(audio_chunks):
    segments, info = model.transcribe(audio_chunks, beam_size=5, vad_filter=True)
    for segment in segments:
        yield {"start": segment.start, "end": segment.end, "text": segment.text}

The key deployment requirement: EU-resident infrastructure only. sota.io deploys containers in Frankfurt by default. A Whisper container deployment on sota.io means audio processing stays within the EU, with no CLOUD Act exposure on the transcription layer.

Migration Path from AWS Transcribe

For teams moving existing Transcribe workloads to EU-sovereign alternatives:

1. Audit existing transcription jobs. Use the Transcribe list-transcription-jobs API to enumerate all existing jobs. Document retention periods and business justification for each job type.

2. Export and delete transcription outputs. Pull transcript JSONs from S3 output buckets, convert to your storage format, then delete the S3 objects and the Transcribe job records.

3. Migrate custom vocabularies. Convert Transcribe custom vocabulary files (CSV/TSV with phrase and SoundsLike columns) to Whisper vocabulary format or Kaldi lexicon format. Delete the Transcribe custom vocabulary resources.

4. Delete Custom Language Models. If you have trained CLMs on Transcribe, document what training data was used, then delete the CLMs. Assess whether any of the training data constitutes personal data that triggered Art.17 obligations.

5. Deploy EU-sovereign ASR. Configure your Whisper API deployment on EU infrastructure. Update application endpoints from Transcribe API calls to your self-hosted API.

6. Update Call Analytics pipelines. AWS Transcribe Call Analytics outputs require migration to equivalent EU-sovereign sentiment analysis tools. Open source options: Hugging Face sentiment models (multilingual, self-hosted), VADER (English), or custom classifiers trained on your domain.

7. Verify CLOUD Act elimination. Confirm that no audio files are transmitted to AWS endpoints. The architectural test: run tcpdump or VPC flow logs during a transcription job. No traffic should reach AWS IP ranges.

Is any processed audio biometric data under Art.9 (speaker identification enabled)?
Is any processed audio health data (medical consultations, therapy sessions)?
Is there an Art.9(2) lawful basis documented for each special-category processing?
Are call participants informed about automated transcription in privacy notices?
Is there a procedure to honor Art.17 erasure requests including transcript deletion?
Do any Custom Language Models need retraining to honor erasure requests?
Is PII redaction configured or is a privacy-by-design alternative deployed?
Is Call Analytics sentiment inference disclosed in privacy notices?
Are custom vocabulary lists reviewed for personal identifiers (names, IDs)?
Is real-time streaming transcription routed to EU-sovereign endpoints only?

Conclusion

AWS Transcribe is a capable and well-documented ASR service. Its GDPR exposure is not a product flaw — it is a structural consequence of being operated by a US-incorporated entity. The CLOUD Act reaches AWS's transcription infrastructure regardless of which region processes your audio.

For voice data, that CLOUD Act reach matters more than for many other AWS services. Voice recordings are potentially biometric under Art.9. Medical transcriptions are health data under Art.9. Speaker diarization creates identification-enabling analysis of physical characteristics. These are special-category data categories that require heightened legal bases and heightened protection.

Self-hosted Whisper on EU infrastructure — deployed via container on sota.io or any EU VPS — provides equivalent transcription accuracy with zero CLOUD Act exposure. Audio never leaves EU jurisdiction. Custom vocabularies stay on EU servers. Transcription job artifacts are stored in EU databases under EU law.

For teams processing high-sensitivity audio — healthcare, legal, HR, contact centers — the migration from AWS Transcribe to EU-sovereign ASR is a straightforward technical project with significant compliance upside. The open source ASR ecosystem in 2026 is mature enough that it no longer requires trading accuracy for sovereignty.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Start free — no credit card View pricing