2026-06-05·5 min read·sota.io Team

GPAI Copyright Compliance & Training Data Audit: EU AI Act Art.53 Developer Guide

Post #3 in the sota.io EU AI Act GPAI Code of Practice Series

GPAI Copyright Compliance Training Data Audit EU AI Act

If you provide a general-purpose AI (GPAI) model — whether frontier, fine-tuned, or open-weight — and that model reaches EU users, Article 53 of the EU AI Act activates a copyright compliance obligation that most engineering teams are not prepared for. With 58 days until August 2, 2026, the window for building compliant training data documentation is closing fast.

This post is the third in our GPAI Code of Practice series. Post #1 covered the overall GPAI framework. Post #2 covered model card and technical documentation. Here we go deep on the copyright dimension: what you must document, how TDM opt-out compliance works in practice, what an "unresolved claims register" looks like, and how to structure the evidence trail an NCA will actually review.


The Copyright Obligation Under Article 53

Article 53(1)(b) of the EU AI Act requires GPAI model providers to:

"put in place a policy to comply with Union copyright law and in particular to identify and respect, including through state-of-the-art technologies, the reservations of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790"

This is not a vague "best efforts" clause. It has three concrete components:

  1. A documented policy — not just a practice, but a written policy that can be audited
  2. Identification of rights reservations — you must actively check whether rights holders have opted out of text and data mining (TDM)
  3. State-of-the-art technologies — you must use technical means, not just manual spot-checks, to identify and honor opt-outs

The law referenced in Art.53(1)(b) — Directive 2019/790 Art.4(3) — is the Directive on Copyright in the Digital Single Market (DSM Directive). Art.4(3) allows rights holders to opt out of the TDM exception that would otherwise permit their content to be used for machine learning training. A robots.txt entry, a Rights Reservation in metadata, or an explicit license restriction can constitute a valid opt-out.

If you scraped the open web to train your model and did not check for Art.4(3) opt-outs, you are not compliant with Art.53(1)(b).


Training Data Audit: What You Must Document

Before you can implement a copyright compliance policy, you need a complete inventory of your training data. Under the GPAI Code of Practice (CoP), this inventory feeds directly into the technical documentation required by Art.53(1)(a) and into the public summary required by Art.53(1)(c).

Data Source Classification

For each corpus or dataset used in pre-training or fine-tuning, document:

FieldExample
Source nameCommon Crawl, Wikipedia, PubMed Central, GitHub, licensed corpus
Collection methodWeb scrape, direct license, API pull, synthetic generation
Collection date range2022-01 to 2024-06
Volume1.2 TB raw, ~800 GB after deduplication
License typePublic domain, CC-BY-4.0, proprietary license, TDM-exception
Opt-out scan performedYes/No + methodology
Unresolved claimsCount + tracking ID range

This table is not optional. It is the evidence trail. An NCA reviewing your Art.53 compliance will start here.

License Classification Matrix

Training data falls into four license tiers under EU copyright law:

Tier 1 — Fully cleared:

Tier 2 — TDM exception applies (DSM Art.4):

Tier 3 — Opt-out present — EXCLUDED:

Tier 4 — Unresolved claims:


TDM Opt-Out Compliance: Reading robots.txt and Rights Signals

The DSM Directive Art.4(3) opt-out mechanism was abstract until the industry began standardising on two technical signals:

1. robots.txt TDM Directives

The following robots.txt directives are widely recognised as TDM opt-outs:

User-agent: *
Disallow: /

# Or more specifically:
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

Under the GPAI CoP, your scraper must:

2. IPTC Rights and Indicator (RTI) Signals

The IPTC Rights Reservation Indicator standard, embedded in image and article metadata, uses two key values:

For web-scraped text content, the equivalent is HTTP response headers:

X-Robots-Tag: noai
X-Robots-Tag: noimageai

Your training data pipeline must capture and log these signals. A scraper that ignores X-Robots-Tag: noai headers is not "using state-of-the-art technologies" to respect opt-outs — it is failing the Art.53(1)(b) standard.

3. The tdmrep Standard

The TDM Reservation Protocol (tdmrep) is a W3C Community Group specification that allows rights holders to publish machine-readable TDM permissions at a well-known URI:

https://example.com/.well-known/tdmrep.json

The GPAI CoP explicitly references tdmrep as a state-of-the-art mechanism for identifying opt-outs. If you are building a training data pipeline post-August 2026, checking /.well-known/tdmrep.json is not optional — it is part of the baseline compliance expectation.


Building the Unresolved Claims Register

Not all copyright status questions can be resolved before training. Some rights holders issue claims after your training cut-off. Some licenses are genuinely ambiguous. The GPAI Code of Practice acknowledges this reality and requires that you maintain an Unresolved Claims Register — a documented record of known copyright questions and your response process.

Minimum Register Schema

from dataclasses import dataclass, field
from datetime import date
from typing import Optional, Literal

@dataclass
class UnresolvedCopyrightClaim:
    claim_id: str                          # UCR-2024-001
    source_url: str                        # URL or corpus reference
    claim_type: Literal[
        "opt_out_detected_post_training",
        "rights_holder_assertion",
        "license_ambiguity",
        "tdmrep_conflict"
    ]
    claim_detected_date: date
    claim_status: Literal[
        "open",
        "resolved_excluded",
        "resolved_license_obtained",
        "resolved_public_domain_confirmed"
    ]
    resolution_date: Optional[date] = None
    resolution_notes: str = ""
    data_volume_affected: str = ""         # "~2.3GB of 1.2TB training set"
    mitigation_action: str = ""            # "Excluded from next training run"

The Register Lifecycle

  1. Detection — your pipeline flags a URL, domain, or corpus segment as having an unresolved status
  2. Registration — a UCR-YYYY-NNN identifier is created
  3. Legal review — your legal team (or rights-clearance process) assesses the claim
  4. Resolution or escalation — the claim is either resolved or escalated to NCA disclosure
  5. Retention — all records retained for at minimum 10 years (aligned with Art.53 documentation requirements)

The register is not just an internal housekeeping tool. Under the GPAI Code of Practice, it is audit evidence. NCAs can request access to it during compliance reviews.


The GPAI Code of Practice (finalised in July 2026, effective August 2, 2026) establishes a structured compliance track for copyright obligations. Signatories to the CoP commit to:

  1. Policy publication — a publicly accessible copyright compliance policy (typically a page at /legal/copyright-ai or in the technical documentation summary)
  2. Opt-out scanning — documented use of at minimum one state-of-the-art TDM detection method (robots.txt, tdmrep, IPTC RTI)
  3. Claims register maintenance — quarterly review cycles minimum
  4. Downstream notification — informing deployers who access your model via API about known data restrictions that affect downstream use cases

The CoP provides an alternative compliance pathway to direct Art.53 enforcement. If you are a signatory and can demonstrate adherence to the CoP copyright track, the EU AI Office treats this as a rebuttable presumption of compliance with Art.53(1)(b). This matters enormously: it shifts the burden from "prove you are compliant" to "show us your CoP adherence records."


Python Implementation: GPAICopyrightComplianceTracker

Here is a minimal implementation scaffold that covers the core compliance requirements:

import hashlib
import json
from dataclasses import dataclass, field, asdict
from datetime import date, datetime
from pathlib import Path
from typing import Optional
import urllib.request
import urllib.robotparser


@dataclass
class DataSourceRecord:
    source_id: str
    source_name: str
    source_url: str
    collection_method: str
    collection_date_start: str
    collection_date_end: str
    volume_gb: float
    license_tier: int          # 1=cleared, 2=TDM-exception, 3=excluded, 4=unresolved
    opt_out_scan_method: str
    opt_out_scan_date: str
    excluded_domains_count: int = 0
    unresolved_claims_count: int = 0
    notes: str = ""


class GPAICopyrightComplianceTracker:

    def __init__(self, registry_path: str = "gpai_copyright_registry.json"):
        self.registry_path = Path(registry_path)
        self.data_sources: list[DataSourceRecord] = []
        self.unresolved_claims: list[dict] = []
        self._load()

    def _load(self):
        if self.registry_path.exists():
            raw = json.loads(self.registry_path.read_text())
            self.data_sources = [DataSourceRecord(**s) for s in raw.get("data_sources", [])]
            self.unresolved_claims = raw.get("unresolved_claims", [])

    def _save(self):
        self.registry_path.write_text(json.dumps({
            "generated": datetime.utcnow().isoformat(),
            "data_sources": [asdict(s) for s in self.data_sources],
            "unresolved_claims": self.unresolved_claims,
        }, indent=2))

    def check_robots_txt_tdm(self, domain: str) -> bool:
        """Returns True if TDM is blocked by robots.txt."""
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(f"https://{domain}/robots.txt")
        try:
            rp.read()
            for agent in ["GPTBot", "CCBot", "anthropic-ai", "*"]:
                if not rp.can_fetch(agent, f"https://{domain}/"):
                    return True
        except Exception:
            pass
        return False

    def check_tdmrep(self, domain: str) -> Optional[dict]:
        """Fetches .well-known/tdmrep.json if present."""
        try:
            url = f"https://{domain}/.well-known/tdmrep.json"
            req = urllib.request.Request(url, headers={"User-Agent": "TDMRepChecker/1.0"})
            resp = urllib.request.urlopen(req, timeout=5)
            return json.loads(resp.read())
        except Exception:
            return None

    def register_data_source(self, record: DataSourceRecord):
        self.data_sources.append(record)
        self._save()

    def register_claim(
        self,
        source_url: str,
        claim_type: str,
        data_volume: str,
        notes: str = ""
    ) -> str:
        claim_id = f"UCR-{date.today().year}-{len(self.unresolved_claims)+1:04d}"
        claim = {
            "claim_id": claim_id,
            "source_url": source_url,
            "claim_type": claim_type,
            "claim_detected_date": date.today().isoformat(),
            "claim_status": "open",
            "data_volume_affected": data_volume,
            "notes": notes,
            "resolution_date": None,
            "mitigation_action": "",
        }
        self.unresolved_claims.append(claim)
        self._save()
        return claim_id

    def resolve_claim(self, claim_id: str, resolution: str, action: str):
        for claim in self.unresolved_claims:
            if claim["claim_id"] == claim_id:
                claim["claim_status"] = resolution
                claim["resolution_date"] = date.today().isoformat()
                claim["mitigation_action"] = action
        self._save()

    def generate_compliance_summary(self) -> dict:
        """Produces the Art.53(1)(c) training data summary."""
        total_volume = sum(s.volume_gb for s in self.data_sources)
        tier_breakdown = {}
        for s in self.data_sources:
            tier_breakdown[s.license_tier] = tier_breakdown.get(s.license_tier, 0) + s.volume_gb
        open_claims = sum(1 for c in self.unresolved_claims if c["claim_status"] == "open")
        return {
            "generated_date": date.today().isoformat(),
            "total_training_data_gb": total_volume,
            "source_count": len(self.data_sources),
            "license_tier_breakdown_gb": tier_breakdown,
            "opt_out_exclusions": sum(s.excluded_domains_count for s in self.data_sources),
            "unresolved_claims_open": open_claims,
            "unresolved_claims_total": len(self.unresolved_claims),
            "compliance_note": "Art.53(1)(b) EU AI Act — copyright compliance policy in effect",
        }

One dimension of GPAI copyright compliance that teams often miss: jurisdiction affects enforcement exposure.

If your GPAI model's training infrastructure and model weights reside on US hyperscalers (AWS, Azure, GCP), the EU AI Office's enforcement powers interact with the US CLOUD Act. US authorities can compel access to training data stored on US cloud infrastructure even if that data is nominally "in EU regions."

For GPAI providers who have significant exposure from ambiguous training data sources — Common Crawl, social media, paywalled content with uncertain TDM opt-out status — EU-native infrastructure (Hetzner, Scaleway, OVHcloud, or EU-sovereign platforms like sota.io) means your compliance data and model artefacts sit exclusively under EU jurisdiction. This does not eliminate copyright risk, but it eliminates the cross-jurisdictional exposure layer that makes complex rights disputes harder to resolve.

The GPAI Code of Practice does not mandate EU-native infrastructure, but several AI Office guidance documents note that infrastructure jurisdiction affects the enforceability of data access orders during NCA reviews.


Evidence Checklist for NCA Audit

Before August 2, 2026, your GPAI copyright compliance documentation should include:

GPAI Copyright Compliance — Pre-August 2026 Checklist

Art.53(1)(b) Policy:
[ ] Written copyright compliance policy published at accessible URL
[ ] Policy specifies which TDM opt-out signals are checked (robots.txt, tdmrep, X-Robots-Tag)
[ ] Policy version history maintained (dates, changes)

Training Data Inventory:
[ ] All pre-training and fine-tuning corpora documented
[ ] Source, collection date, volume, and license tier for each corpus
[ ] Opt-out scan methodology documented per corpus
[ ] Scan date and tooling version recorded

TDM Opt-Out Compliance:
[ ] robots.txt exclusions logged with domain list and timestamps
[ ] tdmrep.json scans logged where applicable
[ ] X-Robots-Tag: noai / noimageai headers checked during collection
[ ] Domains excluded due to opt-out: count and representative sample available

Unresolved Claims Register:
[ ] Register exists with UCR identifier schema
[ ] All open claims have assigned legal reviewer
[ ] Claims older than 90 days have resolution notes or escalation record
[ ] Register reviewed on quarterly cadence (last review date documented)

Downstream Notification:
[ ] API documentation discloses known training data restrictions
[ ] Downstream providers notified of any restrictions affecting downstream use cases
[ ] Notification mechanism tested (can produce NCA evidence of delivery)

Art.53(1)(c) Public Summary:
[ ] Public summary of training data categories published
[ ] Summary does not disclose trade secrets but covers data types and sources
[ ] Summary updated with each major training run

What Comes Next in This Series

Post #4 will cover GPAI Systemic Risk Assessment and Red Teaming — the adversarial testing obligations that apply to GPAI models above the systemic risk threshold. Post #5 is the compliance stack finale, covering the full registry, evidence package, and EU-native infrastructure checklist for August 2, 2026.

If you are building a GPAI model and need to deploy it on EU-sovereign infrastructure with built-in compliance documentation workflows, sota.io provides a European PaaS with zero US hyperscaler data residency exposure. Start with a free deployment today and keep your GPAI compliance jurisdiction clean.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.