AWS Comprehend EU Alternative 2026: NLP, PII Detection, and the GDPR Problem
Post #733 in the sota.io EU Compliance Series
AWS Comprehend is Amazon's managed natural language processing service. It extracts named entities, detects PII, performs sentiment analysis, classifies documents, identifies key phrases, and — through ComprehendMedical — extracts structured medical information from clinical text. For development teams that need to analyze text at scale without building NLP pipelines from scratch, Comprehend is the obvious starting point: no ML expertise required, pay-per-call pricing, and tight integration with the rest of the AWS ecosystem.
That convenience introduces GDPR exposure that is not immediately obvious from the API documentation. Comprehend does not just analyze text — it creates derivative records, builds behavioral profiles, logs PII detection results to CloudWatch, and in the medical variant, processes Article 9 health data in US-jurisdiction infrastructure. Each of these creates a distinct GDPR problem that requires either a documented legal basis, a valid transfer mechanism, or an architectural change to remain compliant.
Amazon Web Services, Inc. is a Delaware corporation headquartered in Seattle, Washington. The CLOUD Act (18 U.S.C. § 2713) authorizes US law enforcement and intelligence agencies to compel production of data held by US companies, regardless of where that data is stored. When the text being analyzed contains health information, political opinions, religious content, or behavioral signals — all of which Comprehend can extract and log — CLOUD Act reach extends to that analysis output.
This analysis covers six GDPR exposure points in AWS Comprehend that European development teams need to understand before deploying any NLP pipeline in production.
ComprehendMedical and Article 9 Health Data
AWS Comprehend Medical is a specialized service that extracts medical entities from unstructured clinical text: diagnoses, medications, dosages, anatomical references, test results, medical procedures, and temporal relationships between clinical events. It is designed for processing clinical notes, discharge summaries, doctor's letters, and EHR data that arrives as free text rather than structured fields.
The GDPR classification of this data is unambiguous. Article 9(1) defines "data concerning health" as a special category of personal data requiring explicit legal basis before processing. EDPB guidelines have consistently interpreted this broadly: not just formal diagnoses, but any data from which health information can be inferred counts as Article 9 data. ComprehendMedical's output — structured JSON with extracted medications, conditions, and clinical entities — is health data by definition.
The CLOUD Act problem is acute here. Clinical text containing patient diagnoses, medications, and treatment histories is processed by an AWS service running on US infrastructure operated by a US company. Under the CLOUD Act, US authorities can compel AWS to produce this data regardless of whether it is stored in an AWS EU region. Article 9 requires "explicit consent" or another explicit legal basis from Article 9(2) — but no explicit consent provision addresses US law enforcement access to health data processed in a cloud service. Most Article 9 legal bases (employment law, vital interests, public health) do not cover commercial NLP processing by a cloud provider.
DPA guidance is clear. Multiple EU data protection authorities have specifically flagged health data processing in US cloud services as structurally incompatible with GDPR Article 9 absent a valid and robust transfer mechanism. Since Schrems II, adequacy decisions for health data processing in the US context have faced ongoing legal challenge. The mandatory DPIA requirement under Article 35(3)(b) for large-scale processing of special category data applies to ComprehendMedical deployments.
Practical implication: Any team running ComprehendMedical on patient records, clinical notes, or health-related user data requires a documented Article 9(2) legal basis, a current DPIA filed with its lead supervisory authority, and — under most DPA interpretations — cannot rely on Standard Contractual Clauses alone given CLOUD Act residual risk for health data.
PII Detection Creates a Compliance Audit Trail Under US Jurisdiction
Comprehend's PII detection API (DetectPiiEntities, ContainsPiiEntities) is designed to help teams find and redact personal data in documents. The feature is built for compliance use cases: identify credit card numbers, social security numbers, email addresses, names, and addresses before storing or sharing documents.
There is a structural irony in the compliance design. To use Comprehend for PII detection, the full text — including all the PII you are trying to manage — is transmitted to AWS servers. The detection results, including entity types and character offsets, are logged in CloudWatch by default. The logs contain enough information to reconstruct which documents contained which types of PII, with timestamps.
The CloudWatch log is itself a GDPR-relevant record. Under Article 30, data controllers must maintain records of processing activities. The Comprehend CloudWatch logs become part of your Article 30 record — but they are stored in US-jurisdiction infrastructure, accessible to AWS under CLOUD Act, and may be retained beyond your own retention policy unless you explicitly configure CloudWatch log expiration.
The audit trail inversion problem. Using a US cloud service to detect GDPR-regulated PII means that your compliance tool creates its own compliance liability. The detection metadata (document identifier, detected PII categories, detection timestamps) is personal data about your users' data, stored under US jurisdiction. Article 5(1)(b) purpose limitation and Article 5(1)(e) storage limitation apply to this secondary log — but you do not control the underlying CloudWatch retention unless explicitly configured.
For large-scale document processing, the volume of PII detection logs compounds the risk. A document management system running hundreds of thousands of PII detection requests generates a CloudWatch log that is itself a substantial record of personal data processing under US jurisdiction. Cross-border transfer rules under Article 46 apply to this log, not just to the documents themselves.
Sentiment Analysis as Behavioral and Psychological Profiling
Comprehend's sentiment analysis API assigns positive, negative, neutral, or mixed sentiment scores to text. For customer feedback analysis, support ticket triage, and social media monitoring, it provides a quick signal about customer emotional state at scale.
GDPR's treatment of behavioral and psychological data has expanded significantly since Schrems II. The EDPB's guidelines on automated individual decision-making and profiling under Article 22 define profiling broadly: "any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person." Sentiment analysis applied to individually identifiable communications — support tickets, chat messages, app reviews with user IDs — creates psychological profiles that fall squarely within this definition.
The special category inference risk. When sentiment analysis is applied to text that discusses health conditions, political views, religious beliefs, or sexual orientation, the sentiment scores become derived special category data. A support ticket from a user discussing medication side effects, analyzed for sentiment and stored with the user record, links a sentiment score to health-related content. EDPB guidance treats such derived data as potentially triggering Article 9 protection, particularly when the inference creates a behavioral or psychological profile linked to a specific individual.
CloudWatch sentiment logs. Comprehend's async batch sentiment jobs write results to S3 and log job metadata to CloudWatch. The job output files contain sentiment scores linked to document identifiers — if those identifiers map back to user records (as they do in most real implementations), the sentiment output is personal data under Article 4(1). The combination of user ID, timestamp, and sentiment score in a US-jurisdiction S3 bucket creates an Article 46 transfer risk.
Article 22 automated decision-making. If sentiment scores feed into automated decisions — customer tier assignments, support queue prioritization, churn risk scoring — the Article 22 framework applies. This requires either explicit consent, contractual necessity, or authorization by EU or Member State law. It also creates right-to-explanation obligations that are difficult to fulfill when the underlying sentiment analysis is a black-box AWS API call.
Custom Entity Recognizer: Training Data Under CLOUD Act Reach
Comprehend allows organizations to train custom Named Entity Recognition (NER) models using their own annotated data. This is useful for domain-specific entity types: product codes, internal identifiers, specialized terminology, or entity types not covered by Comprehend's built-in recognizers.
Training a custom entity recognizer requires uploading training data to Amazon S3 and triggering a Comprehend training job. The training data — annotated text documents that may contain customer names, internal codes linked to individuals, or domain-specific PII — is processed by AWS infrastructure during training and stored in AWS-managed buckets.
The training data transfer problem. The upload of personal data for model training constitutes a data transfer to a third-country processor (AWS as a US company) and requires a valid Article 46 mechanism. For training data containing special category personal data, Chapter V transfer rules apply with additional scrutiny. The training process itself — during which AWS infrastructure processes your annotated documents — falls under Article 28 processor requirements.
Model retention and intellectual property. After training, Comprehend retains the custom model in AWS infrastructure. The model is trained on your data, encodes patterns from your data, and may contain sufficient information to partially reconstruct training examples. This creates an ongoing Article 5(1)(b) purpose limitation question: the training data was collected for your product purpose, but it is now encoded in a model retained by AWS. Data subject erasure requests under Article 17 become structurally difficult to fulfill — you cannot "erase" a person's contribution from a trained neural network without retraining from scratch.
The annotation metadata log. Training job metadata — submission timestamps, S3 object references, model performance metrics — is stored in CloudWatch and AWS service logs under US jurisdiction. This metadata creates a secondary record of what data was processed and when, subject to CLOUD Act access.
Async Document Jobs: Uncontrolled Retention of Processing History
Comprehend's asynchronous APIs — StartEntitiesDetectionJob, StartKeyPhrasesDetectionJob, StartSentimentDetectionJob, StartTopicsDetectionJob — are designed for batch processing of large document sets. They process input from S3, write results to S3, and generate job records that persist in the Comprehend service.
The job records are not simply temporary execution logs. They contain: input S3 paths (pointing to your documents), output S3 paths (pointing to analysis results), job submission timestamps, data access role ARNs, and status history. This job history is accessible through the Comprehend API and is retained in AWS infrastructure indefinitely unless you explicitly delete jobs.
Article 5(1)(e) storage limitation. The principle of storage limitation requires that personal data be kept "no longer than is necessary for the purposes for which the personal data are processed." Comprehend job records contain pointers to documents that may themselves be deleted after your retention period — but the job record persists, creating a reference to data that no longer exists and a log of processing that may extend beyond your intended retention window. The job history is personal data when the processed documents contain personal data.
Cross-account and cross-region job propagation. Comprehend jobs in AWS Organizations environments can be triggered across accounts. The job metadata propagates across account boundaries and may be visible in centralized AWS management accounts, CloudTrail aggregation accounts, or AWS Organizations service control policy logs. Each propagation step creates additional processing under Article 4(2) and additional transfer exposure under Article 46.
Article 30 compliance gap. Your Article 30 records of processing activities must document all processing of personal data. Comprehend job history constitutes an additional processing record not always captured in manual Article 30 inventories. If a supervisory authority requests your processing records during an investigation, Comprehend job history may reveal processing activities not reflected in your official documentation.
Topic Modeling: Corpus Intelligence Under AWS Custody
Comprehend's StartTopicsDetectionJob runs Latent Dirichlet Allocation (LDA) or equivalent topic modeling over your entire document corpus. It identifies recurring themes, clusters documents by topic, and produces a topic-term matrix describing the corpus. This is useful for document discovery, content categorization, and trend analysis at scale.
The GDPR issue with topic modeling is subtle but real. LDA operates over the entire corpus simultaneously: the statistical model produced encodes patterns extracted from all documents, including documents that contain personal data. The resulting topic model — stored in AWS S3 as Comprehend job output — is a mathematical representation of patterns in your data. If those patterns are derived from personal communications, support tickets, or health records, the topic model itself may qualify as personal data under the GDPR's broad definition in Article 4(1).
Corpus-level inference and Article 22. Topic models are not just summaries — they are inference engines. Applied back to new documents, they can assign topic probabilities and infer behavioral patterns about the users who generated those documents. If a topic model trained on customer support tickets is used to predict which customers are likely to churn, Article 22 automated decision-making provisions may apply. The model, stored in AWS infrastructure, becomes part of a decision-making pipeline where AWS holds a component that encodes patterns from your users' data.
EDPB enrichment analysis risk. EDPB guidelines on data minimization under Article 5(1)(c) require that processing be limited to what is necessary for the stated purpose. Running corpus-wide topic modeling on personal communications to categorize support tickets goes beyond the minimum necessary for ticket routing. It creates a corpus-level intelligence product — stored under US jurisdiction — that was not the stated purpose of collecting the original communications.
Deletion cascade impossibility. If a data subject exercises their Article 17 right to erasure, you can delete the original documents. But a topic model trained on those documents cannot be "un-trained" from their content. The model encodes statistical patterns from erasure-requested data indefinitely. AWS's model retention practices mean the topic model persists in US infrastructure after the underlying personal data has been erased — creating a structural Article 17 compliance gap.
EU-Native Alternatives to AWS Comprehend
The NLP ecosystem has strong EU-deployable options that cover Comprehend's core use cases without the cross-border transfer exposure.
spaCy — Self-Hosted NLP Pipeline
spaCy is the leading open-source NLP library for production applications. It provides named entity recognition, PII detection, part-of-speech tagging, dependency parsing, and text classification. Statistical models are available for German, French, Spanish, Italian, Portuguese, Dutch, and other EU languages, plus English. spaCy runs fully on-premises or on any cloud infrastructure — no data leaves your environment.
For PII detection specifically, spaCy combined with the presidio-analyzer library (Microsoft Research, open-source) provides a production-grade PII detection pipeline deployable as a microservice. The combination replaces ComprehendMedical's PII capabilities with infrastructure you control, data that never leaves your EU environment, and no CloudWatch retention problem.
Deploy spaCy as a FastAPI microservice on sota.io for a managed EU PaaS option: full API compatibility with your text processing pipeline, no data jurisdiction exposure, automatic restarts, and horizontal scaling. The container runs in EU data centers under your exclusive control.
Hugging Face — EU-Accessible Model Hub with Self-Hosting
Hugging Face provides the largest open-source NLP model repository, including fine-tuned NER, sentiment, classification, and question-answering models for European languages. The Transformers library allows you to run any model locally or on your own infrastructure.
For sentiment analysis, cardiffnlp/twitter-xlm-roberta-base-sentiment and similar multilingual models provide better European language coverage than Comprehend's US-English-optimized sentiment API. For German NER, deepset/bert-base-german-cased-ner is production-tested. Both run locally — no API calls, no CloudWatch logs, no CLOUD Act exposure.
Hugging Face's Inference Endpoints product (hosted inference) is available in EU regions and operates under EU data residency commitments. For teams that cannot run their own GPU infrastructure, this provides a middle path: managed inference with EU data residency guarantees and no US company processing path.
Flair NLP — German Origin, EU-First
Flair is an NLP framework developed at Humboldt University Berlin (and now maintained by a German team). It provides state-of-the-art NER for German, French, Spanish, Dutch, and other European languages, plus text classification and sequence labeling. Flair models are pre-trained and available for download; inference runs locally.
For German NER specifically, Flair outperforms most commercial alternatives on CoNLL-2003 benchmarks for German text. It is the recommended choice for German-language NLP pipelines that need to process customer communications, legal documents, or health records where US jurisdiction exposure is not acceptable.
John Snow Labs NLP — Healthcare-Specific EU Alternative
John Snow Labs NLP (now Spark NLP Healthcare) provides HIPAA-compliant, self-hosted NLP for healthcare text analysis. It directly replaces ComprehendMedical for clinical NLP use cases. The library runs on-premises or in private cloud, processes health records locally, and provides extractors for medications, diagnoses, clinical entities, and temporal relationships without transmitting data to any external service.
For GDPR Article 9 health data processing, John Snow Labs NLP is the architecturally correct replacement: data stays within your EU infrastructure, no US cloud service processes your clinical text, and the deletion cascade problem disappears because you own the model and can retrain on demand.
Apache OpenNLP — Open Source with EU Deployment
Apache OpenNLP is a mature, open-source NLP toolkit for maximum-control NLP deployments. It provides NER, sentence detection, tokenization, and text categorization. It lacks the accuracy of modern transformer-based alternatives for most tasks, but it requires zero dependencies on any external model registry or commercial service, which makes it appropriate for high-compliance environments where supply chain risk matters as much as accuracy.
The sota.io Deployment Pattern
The fastest path from Comprehend to an EU-compliant NLP deployment: deploy a spaCy or Hugging Face inference container on sota.io. Your application sends text to your own container running in EU infrastructure. The container runs NER, PII detection, sentiment analysis, or document classification locally. Results return to your application without leaving EU jurisdiction. No CloudWatch. No job history. No CLOUD Act exposure. Article 30 records stay accurate. Article 17 erasure requests affect only your data — no third-party model retention.
For teams already using Comprehend's API, the spaCy REST wrapper and Hugging Face's standardized Inference API both offer compatible request/response formats. Migration requires changing API endpoint URLs and authentication — not rewriting NLP integration code.
Summary
AWS Comprehend creates six distinct GDPR exposure points: ComprehendMedical processes Article 9 health data in US-jurisdiction infrastructure with CLOUD Act reach; PII detection creates a compliance audit trail under US jurisdiction that is itself a GDPR processing record; sentiment analysis generates behavioral profiles that may trigger Article 22 and special category protections; custom entity recognizer training exposes annotated documents to CLOUD Act reach and creates an Article 17 erasure gap in trained models; async batch jobs generate persistent processing history records with indefinite retention in AWS; and topic modeling creates corpus-level intelligence products stored in US infrastructure that encode patterns from personal data after source documents may have been erased.
Each of these exposure points is addressable through EU-native deployment. spaCy, Hugging Face, Flair, and John Snow Labs NLP collectively cover every Comprehend use case. Deploy them as microservices on sota.io for managed EU PaaS infrastructure — or self-host them in your own EU environment for maximum control. The choice of deployment model matters less than the architectural principle: NLP processing pipelines that handle personal data belong in infrastructure under your control, in jurisdictions where CLOUD Act compulsion cannot reach your users' text.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.