2026-04-30·12 min read

AWS Textract EU Alternative 2026: OCR, Document Processing, and the GDPR Problem

Post #734 in the sota.io EU Compliance Series

AWS Textract is Amazon's managed document processing service. It extracts printed text, handwriting, structured form data, table contents, and identity document fields from scanned PDFs, images, and multi-page documents — without requiring any ML expertise, model training, or infrastructure management. For teams building document-heavy workflows — onboarding pipelines, invoice automation, KYC systems, healthcare record digitisation, or insurance claims processing — Textract is the fastest path from raw scan to structured data.

That speed carries a GDPR cost that is easy to underestimate. Document processing is rarely discussed in the same breath as biometric data or health records, but the documents Textract processes routinely contain the most sensitive personal data imaginable: patient files, identity cards, passports, bank statements, tax returns, and medical prescriptions. When Textract processes these documents, it is processing personal data — and in many cases, special category data under Article 9 — through an API controlled by Amazon Web Services, Inc., a Delaware corporation headquartered in Seattle, Washington. The CLOUD Act (18 U.S.C. § 2713) means that US authorities can compel production of that data regardless of which AWS region processes it.

This analysis covers six GDPR exposure points in AWS Textract that European development teams need to understand before deploying document processing pipelines in production.

What Textract Actually Does

Textract provides four main API operations. DetectDocumentText extracts raw text and word-level geometry. AnalyzeDocument extracts form key-value pairs and table structures from business documents. AnalyzeID is a dedicated API for identity documents — it extracts structured fields from passports, driver licences, and national identity cards including name, date of birth, document number, nationality, and address. AnalyzeExpense processes receipts and invoices.

Each API call sends the document to AWS infrastructure for processing. Textract supports both synchronous processing for single-page documents and asynchronous jobs via StartDocumentTextDetection and StartDocumentAnalysis for multi-page PDFs. The asynchronous path means documents are stored in S3, processed by Textract's backend infrastructure, and results written back to S3 — creating multiple storage points under AWS control for the duration of the job and beyond.

Healthcare is one of Textract's primary target use cases. AWS documentation explicitly covers hospital workflows: digitising patient admission forms, extracting structured data from lab results, processing medical imaging reports, and automating prescription processing. European hospitals, clinics, insurance companies, and health-tech startups routinely evaluate Textract for exactly these use cases.

Article 9(1) of the GDPR prohibits processing data "concerning health" without an explicit legal basis under Article 9(2). Health data is defined broadly in Article 4(15): it covers "personal data related to the physical or mental health of a natural person, including the provision of health care services, which reveal information about his or her health status." A scanned patient record, a prescription, a lab result, a medical history form — all of these are health data. Extracting text from them using Textract means transmitting that health data to AWS infrastructure under US jurisdiction.

The practical problem is not just data residency. Even if you select the eu-west-1 or eu-central-1 region, the service endpoint is operated by a US-controlled entity. The CLOUD Act does not require data to be stored in the United States — it requires only that the provider is a US company with possession, custody, or control of the data. AWS, as a US company, satisfies that test for any data it processes, regardless of the AWS region.

For health data under Article 9, this creates a structural compliance problem. Your legal basis under Article 9(2)(h) (health care purposes) authorises your processing — it does not authorise AWS to receive a compelled disclosure order for that data on behalf of US authorities. The data protection authority in your jurisdiction may well take the view that transmitting Article 9 health data to infrastructure controlled by a CLOUD Act-subject company requires a separate legal basis or specific contractual safeguards that go beyond standard AWS Data Processing Addendums.

Textract's AnalyzeID API is purpose-built for extracting data from identity documents. The API returns structured fields including FIRST_NAME, LAST_NAME, DATE_OF_BIRTH, DOCUMENT_NUMBER, EXPIRATION_DATE, ID_TYPE, ISSUING_COUNTRY, and ADDRESS. It supports passports, driver licences, and identity cards from multiple countries.

KYC (Know Your Customer) and AML (Anti-Money Laundering) workflows in fintech, banking, and regulated industries routinely require extracting this data. The compliance problem is layered. First, national identity document numbers, passport numbers, and date-of-birth-plus-document combinations are personal data in the ordinary sense — they directly identify a natural person. Second, the ISSUING_COUNTRY field, combined with a name and date of birth, can make nationality inferrable — and nationality is sensitive data under Article 9(1) in contexts where it reveals national or ethnic origin.

Third, and most practically relevant: identity document numbers fall under national law implementations of Article 87 GDPR, which allows member states to set specific conditions for processing national identification numbers. Several EU member states treat national ID numbers as quasi-special-category data with stricter processing requirements. Routing identity document images through Textract's AnalyzeID API means those documents — and the structured data extracted from them — are processed by a US-controlled service provider.

For organisations under financial regulation, the combination of CLOUD Act exposure on identity documents and Article 87 national ID restrictions can create an unresolvable conflict between regulatory compliance requirements in different jurisdictions.

When Textract's confidence score for a field falls below a configured threshold, AWS Augmented AI (A2I) can route the document to human reviewers for verification. A2I supports two reviewer workforce options: Amazon Mechanical Turk (a global crowdsourced platform) and private teams configured by the operator.

The Mechanical Turk path is the problematic one. When you enable A2I with the public workforce, documents that Textract cannot confidently parse are sent to Mechanical Turk workers — who may be located anywhere in the world, are not employees of your organisation, and are not subject to your data processing agreements or GDPR obligations. For documents containing personal data, this creates an uncontrolled sub-processor chain that is structurally incompatible with Article 28's requirement for documented processor agreements and Article 5(1)(f)'s integrity and confidentiality principle.

Even the private workforce option creates an Article 30 records obligation: you must document human review as a processing activity, identify the legal basis, and ensure that reviewers are subject to confidentiality obligations. For medical documents or identity documents, the combination of automated Textract extraction and human review means that sensitive documents are viewed by a larger number of individuals than operators typically anticipate when selecting "automated document processing."

The data minimisation principle under Article 5(1)(c) cuts directly against A2I workflows for sensitive documents: if a document contains health data or identity document fields, routing the full document image to a human reviewer exposes far more data than is necessary to correct the specific low-confidence field.

AWS Textract's processing model creates a data lineage challenge for erasure requests under Article 17. When Textract processes a document, the following artefacts are created: the original document in S3, the Textract job result (a JSON payload with all extracted text, bounding boxes, confidence scores, and structured data), and in asynchronous workflows, intermediate processing state within Textract's internal infrastructure.

Deleting the S3 object satisfies the erasure obligation for the original document. It does not necessarily satisfy the obligation for the extracted data. The Textract output JSON contains a full structured representation of the document's contents — names, addresses, dates, document numbers, table rows — that must be independently deleted from any downstream storage it was written to. If Textract results are written to DynamoDB, RDS, Elasticsearch, or a data warehouse, each destination creates an independent erasure obligation.

The structural problem is that Textract makes it easy to create derived data that is semantically dense — a form extraction that converts an unstructured scan into a structured record with named fields. That derived data is personal data in its own right. Teams building Textract pipelines frequently write extracted data to multiple destinations for different downstream consumers: a CRM integration receives the contact fields, an analytics pipeline receives the document metadata, a compliance archive receives the full JSON. Each destination requires its own erasure procedure, and without a complete data flow map, erasure obligations are difficult to satisfy systematically.

Textract is rarely deployed in isolation. The typical production pipeline combines Textract with S3 for document storage, Lambda for processing orchestration, and one or more downstream services: Amazon Comprehend for entity recognition, Amazon Kendra for document search, Amazon OpenSearch for full-text indexing, or DynamoDB for structured storage. AWS provides pre-built reference architectures for exactly this pattern.

Article 5(1)(b) requires that personal data be "collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes." When a Textract pipeline routes extracted text from a patient intake form to Comprehend for entity recognition, to Kendra for a search index, and to DynamoDB for a patient database, each routing decision is a separate processing purpose that must have its own legal basis. The original consent for "digitising your medical form" does not extend to building a full-text search index of that form's contents.

This is not a hypothetical risk. Data protection authorities have found purpose limitation violations in exactly this pattern — where organisations deployed integrated cloud AI pipelines without separately documenting the legal basis for each processing activity within the pipeline. GDPR Article 30 requires records of processing activities that are sufficiently granular to identify each purpose; a generic entry for "document processing" does not cover the multiple downstream uses that modern AI document pipelines enable.

Textract's asynchronous APIs create jobs that persist beyond the processing completion event. Job metadata, status, and results are retained in Textract's internal storage for a period after completion — the exact duration is not publicly documented in Textract's service terms. For synchronous calls, the document is processed and the response returned in a single API call, but the document transmission still involves AWS infrastructure handling the data during processing.

Article 5(1)(e) requires that personal data be "kept in a form which permits identification of data subjects for no longer than is necessary for the purposes for which the personal data are processed." The combination of S3 object lifecycle policies (which you control) and Textract job retention (which AWS controls) creates a dual retention problem: you can delete the S3 document when it is no longer needed, but you cannot control when Textract's own job records expire.

For documents containing Article 9 health data, this is a concrete compliance gap. Your data retention policy may specify deletion after 30 days; Textract's internal retention period may extend beyond that. Article 28(3)(g) requires that processors "delete or return all the personal data to the controller after the end of the provision of services." Standard AWS Data Processing Addendums include general deletion commitments, but the specifics of Textract's internal job retention are not surfaced at the level of granularity that a systematic compliance review requires.

EU-Native Alternatives for Document Processing and OCR

The EU-hosted OCR and document processing landscape is substantially more mature than it was three years ago. Several high-accuracy alternatives are self-hostable on European infrastructure, with MIT or Apache 2.0 licences that allow commercial use without usage-based pricing or cloud vendor lock-in.

Tesseract OCR

The original open-source OCR engine, now maintained under the Apache 2.0 licence. Tesseract 5.x uses LSTM-based neural networks and supports over 100 languages including all EU official languages. It is the de facto standard for self-hosted OCR and has been in production use for over thirty years. It handles printed text reliably; handwriting recognition is limited. Deploy via Docker on any EU VPS or Kubernetes cluster. No cloud dependency, no usage fees, no data transmission outside your infrastructure.

EasyOCR

A Python library that wraps neural network models for OCR with minimal configuration. MIT licence, supports 80+ languages including all EU languages, runs on CPU or GPU. EasyOCR achieves higher accuracy than Tesseract 5 on degraded documents and complex layouts. The models are downloaded once and run locally — no cloud calls required. Particularly strong for multi-language documents and documents with mixed scripts.

Surya

A state-of-the-art OCR toolkit from VikParuchuri, released under MIT licence. Surya uses modern transformer-based models and outperforms Tesseract and EasyOCR on most benchmarks, particularly for layout analysis, table recognition, and reading order detection in complex documents. It includes a dedicated layout analysis model that identifies text blocks, tables, figures, and headers before OCR — similar to Textract's AnalyzeDocument capability. Self-hosted, GPU-accelerated, fully EU-deployable.

docling

IBM Research's open-source document processing library, released under MIT licence. Docling is designed specifically as a structured data extraction tool: it converts PDFs and scanned documents into structured formats (JSON, Markdown, HTML) with full table extraction, reading order preservation, and formula recognition. The output format is comparable to Textract's structured JSON output from AnalyzeDocument. Particularly well-suited for technical documents, scientific papers, and financial reports where table structure matters. Self-hosted, no external API calls.

docTR

Mindee's Document Text Recognition library, Apache 2.0 licence. DocTR provides an end-to-end neural network pipeline for OCR — text detection (finding text regions) and text recognition (reading the text) as separate models that can be combined or used independently. It integrates directly with TensorFlow and PyTorch, supports CPU and GPU deployment, and includes pre-trained models for printed text. The modular architecture allows fine-tuning for domain-specific documents (forms, invoices, medical records) on your own training data.

Apache Tika

A content detection and extraction toolkit from the Apache Software Foundation, Apache 2.0 licence. Tika extracts text and metadata from over 1,000 file formats — PDFs, Word documents, spreadsheets, presentations, images, audio, and video. It is deployed as a server (Tika Server via Docker) and accessed via HTTP API. For organisations already handling heterogeneous document formats, Tika provides a single extraction endpoint that handles format detection automatically. Text extraction from PDFs with embedded text requires no OCR engine; image-based PDFs can be routed to Tesseract via Tika's Tesseract integration.

Privacy-Preserving Identity Document Extraction

For KYC/AML workflows that currently use Textract's AnalyzeID, the privacy-respecting architecture is to perform extraction locally using a combination of docTR or EasyOCR for text detection and a domain-specific field parser — a rules-based or fine-tuned model that maps extracted text to identity document fields (name, date of birth, document number, expiry). This approach keeps identity document images and extracted data entirely within your infrastructure, avoids Textract's AnalyzeID CLOUD Act exposure, and allows you to build the data flow map that Article 30 records of processing require.

Deploying on EU Infrastructure

Running any of these tools on EU infrastructure removes the CLOUD Act jurisdiction problem entirely. A self-hosted deployment on German, French, or Finnish cloud infrastructure — with processors operating solely under EU law — eliminates the structural conflict between GDPR Article 9 obligations and US government compelled disclosure authority.

EU-native managed deployment platforms (such as sota.io) allow you to deploy containerised OCR services — Tesseract, EasyOCR, Surya, docling, docTR, Apache Tika — without managing underlying server infrastructure, while keeping all data processing within EU jurisdiction. The OCR container receives documents via internal API, processes them entirely within the EU deployment boundary, and returns structured results without any transmission to US-controlled infrastructure.

This architecture is particularly relevant for healthcare, fintech, and legal sectors where document processing involves Article 9 data or national ID numbers, and where data processing agreements must demonstrate that sub-processors are not subject to third-country disclosure obligations.

Summary

AWS Textract's GDPR exposure comes from six structural sources: Article 9 health data in medical document workflows, identity document extraction through AnalyzeID creating Article 87 and potential Article 9 issues, Augmented AI human review creating uncontrolled sub-processor chains, Art. 17 erasure gaps between original documents and derived extracted data, cross-service pipelines that expand processing purposes without separate legal bases, and asynchronous job retention outside your data lifecycle control.

The EU alternatives — Tesseract, EasyOCR, Surya, docling, docTR, and Apache Tika — cover the full range of document processing use cases with Apache 2.0 or MIT licences, self-hosted deployment, and no cloud vendor dependency. For organisations processing medical records, identity documents, or any document category containing Article 9 data, the shift to EU-hosted OCR infrastructure eliminates a structural compliance risk that no data processing addendum can fully resolve.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View plans

AWS Textract EU Alternative 2026: OCR, Document Processing, and the GDPR Problem

What Textract Actually Does

GDPR Exposure Point 1: Medical Document Processing Under Article 9

GDPR Exposure Point 2: The AnalyzeID API and Identity Document Extraction

GDPR Exposure Point 3: Augmented AI (A2I) and Human Review

GDPR Exposure Point 4: The Art. 17 Erasure Gap in Extracted Data

GDPR Exposure Point 5: Cross-Service Pipeline and Purpose Limitation

GDPR Exposure Point 6: Asynchronous Job Retention and Art. 5(1)(e) Storage Limitation