2026-04-16·12 min read·

EU AI Act GPAI CoP Chapter 2: Copyright & TDM Opt-Out Compliance for GPAI Model Training — Developer Guide (2026)

Every General-Purpose AI model provider placing a GPAI model on the EU market must comply with EU copyright law including the DSM Directive's text and data mining (TDM) framework. EU AI Act Art.52(1)(c) codifies this as an explicit GPAI obligation, and GPAI Code of Practice Chapter 2 operationalises it into concrete audit commitments and documentation requirements.

For GPAI providers, this means having a verifiable, documented process for respecting rights holders' opt-out reservations before and during training data collection — and publishing a transparency summary under Art.52(2) that describes how that process worked.

For SaaS developers integrating GPAI APIs, CoP Chapter 2 determines what copyright compliance verification you can demand from your GPAI provider — and what downstream liability exposure you carry if your provider's training corpus included opt-out-reserved content without authorisation.

This guide covers the complete CoP Chapter 2 compliance picture: the DSM Directive legal framework, the Art.52 mandatory obligations, the CoP Chapter 2 audit commitments, how TDM opt-out signals work in practice, Python TDMOptOutTracker tooling, and a 25-item implementation checklist.


Copyright compliance for AI training is not a new concern — but the EU AI Act's GPAI chapter makes it a mandatory transparency and audit obligation for the first time, with enforcement attached to non-compliance.

The key shift: before the EU AI Act, copyright compliance for AI training was a matter of civil litigation risk (rights holders suing AI providers for infringement). After Art.52(1)(c) and CoP Chapter 2, it is also a regulatory compliance obligation enforceable by the AI Office under Art.88, with penalties up to 3% of global annual turnover under Art.99(4) for GPAI providers.

The three-layer obligation structure:

LayerSourceObligation
Civil lawDSM Directive Art.4Respect TDM opt-out reservations
Regulatory transparencyEU AI Act Art.52(1)(c)Maintain copyright compliance policy
Regulatory transparencyEU AI Act Art.52(2)Publish training data summary
CoP commitmentCoP Chapter 2Audit-ready documentation + transparency

DSM Directive TDM Framework: Art.3 vs Art.4

The Directive (EU) 2019/790 on Copyright in the Digital Single Market (DSM Directive) introduced two TDM exceptions that frame all GPAI training data copyright compliance:

Art.3: Research TDM Exception (No Opt-Out)

Art.3 provides a mandatory exception (member states cannot restrict it) for TDM by research organisations and cultural heritage institutions for scientific research purposes, on lawfully accessed content. There is no opt-out right for rights holders under Art.3.

Most commercial GPAI providers cannot rely on Art.3 because:

Art.4: General TDM Exception (With Opt-Out Right)

Art.4 provides a general TDM exception available to anyone (including commercial entities), for any purpose, on lawfully accessed content. There is no licence required — but rights holders have an explicit opt-out right under Art.4(3):

"The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online."

Art.4(3) is the critical provision for GPAI training compliance. It means:

  1. GPAI providers can use publicly accessible web content for training without a licence — unless the rights holder has opted out
  2. Rights holders can opt out by expressing a reservation "in an appropriate manner"
  3. For online content, "machine-readable means" is the specified opt-out mechanism
  4. GPAI providers must respect reservations that have been expressed in a machine-readable form

Machine-Readable TDM Opt-Out Signals

The DSM Directive does not define the specific technical format for machine-readable opt-out signals. In practice, three formats have emerged as de facto standards:

1. robots.txt Disallow Directives

The most widely used opt-out mechanism is the robots.txt file. While historically used for search engine crawling, the EU copyright community has established that robots.txt Disallow directives constitute an "appropriate" machine-readable reservation under DSM Art.4(3).

Critical distinction:

# Disallows web crawling (search indexing)
User-agent: Googlebot
Disallow: /

# Disallows AI training data collection — emerging best practice
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /

A rights holder's robots.txt that specifically disallows AI training bots constitutes a clear machine-readable reservation. GPAI providers who scraped content despite bot-specific Disallow directives cannot rely on the Art.4 exception.

A general Disallow: / applying to all crawlers is more ambiguous — courts have not uniformly held that it constitutes a TDM opt-out — but a defensible approach requires treating it as a reservation.

2. ai.txt Protocol

The ai.txt standard (an extension of robots.txt specifically for AI training opt-outs) was developed in 2023–2025 specifically to address the limitation of robots.txt for distinguishing search indexing from AI training. The format:

# ai.txt
User-agent: *
Disallow: /

# Allow specific research crawlers
User-agent: ia_archiver
Allow: /

GPAI providers must treat ai.txt Disallow directives as unambiguous Art.4(3) reservations.

3. HTML Meta Tags and HTTP Headers

Two additional machine-readable signals are recognised:

HTML meta tag:

<meta name="robots" content="noai, noimageai">

HTTP response header:

X-Robots-Tag: noai, noimageai

The noai and noimageai directives signal explicitly that content is not available for AI training. These page-level and asset-level signals are more granular than site-level robots.txt — a rights holder may allow general crawling but opt specific pages or content types out of AI training.

4. Terms of Service Reservations

Art.4(3) permits opt-outs expressed "in an appropriate manner" — not only machine-readable means. However, machine-readable signals are explicitly mentioned for online content, implying they are the required format for publicly accessible content at scale.

Terms of Service opt-out language (common in creative platforms, news publishers, and code hosting services) may supplement machine-readable signals, but GPAI providers relying on large-scale automated crawling must have machine-readable signal detection systems — manually reading ToS for each source at scale is not operationally feasible and is not what Art.4(3) contemplates.


Art.52(1)(c) requires all GPAI providers to:

"… put in place a policy to comply with Union copyright law, in particular to identify and comply with, including through state of the art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790 …"

This is a mandatory obligation, not a best practice. The key elements:

"Put in place a policy" — A documented, implemented process. Not an ad-hoc intention. The policy must be verifiable — AI Office inspectors or national market surveillance authorities can request it.

"State of the art technologies" — GPAI providers must use current technical methods for opt-out detection, not manual sampling. This implies automated crawl-time opt-out signal detection integrated into the data pipeline.

"Identify and comply with … reservation of rights" — Both detection and compliance are required. A policy that detects opt-outs but exceptions them out is not compliant.

Art.52(2): Training Data Transparency Summary

Art.52(2) requires all GPAI providers to:

"… make publicly available a sufficiently detailed summary of the content used for training the GPAI model, according to a template provided by the AI Office."

The transparency summary must cover:

The AI Office has published a template that structures this disclosure. Providers who have completed CoP Chapter 2 commitments should have the underlying data to populate this template.

What "Sufficiently Detailed" Means

The AI Office's transparency summary template (released Q4 2025) requires the copyright section to include:

  1. Which machine-readable opt-out signals were monitored (robots.txt, ai.txt, meta tags)
  2. The technical implementation of opt-out detection in the crawl pipeline
  3. The exclusion rate — what percentage of candidate sources were excluded due to opt-out signals
  4. How retroactive opt-outs (signals added after initial crawl) are handled
  5. Known categories of licensed content (e.g., licensed news archives, Creative Commons datasets)

A transparency summary that says "we respect opt-outs" without specifying the technical implementation is not "sufficiently detailed" under Art.52(2).


CoP Chapter 2 (final draft, Q1 2026) translates Art.52(1)(c) and Art.52(2) into specific, audit-ready commitments that GPAI providers sign up to when joining the Code of Practice.

Commitment 2.1: Pre-Crawl Opt-Out Signal Detection

Providers commit to implementing automated, real-time opt-out signal detection during data collection:

The commitment requires that opt-out detection operates at crawl time (when data is collected), not retrospectively after training data selection. Retroactive opt-out processing (after the crawl) does not satisfy Commitment 2.1 because the rights holder's opt-out was not respected at the point of data collection.

Commitment 2.2: Exclusion Log Maintenance

Providers commit to maintaining a machine-readable exclusion log that records:

The exclusion log must be retained for the duration of model deployment plus 10 years (consistent with Art.18 documentation retention). AI Office audit requests may require access to exclusion logs.

Commitment 2.3: Retroactive Opt-Out Handling Policy

Rights holders add TDM opt-out signals after initial crawls. Providers commit to:

This commitment does not require re-training models every time a retroactive opt-out is received — that would be operationally impossible. It requires a documented triage process that assesses the significance of retroactive opt-outs and responds proportionately.

Commitment 2.4: Licensing Register

For training data that is not covered by the Art.4 exception (because it is not publicly accessible, or because the rights holder has opted out and the provider has negotiated a licence instead), providers commit to maintaining a licensing register:

Commitment 2.5: Transparency Summary Publication and Update

Providers commit to publishing the Art.52(2) transparency summary on a publicly accessible, machine-readable page (not behind authentication) and updating it:


The SaaS Developer Perspective: What to Verify

If you build a product on a GPAI API — Claude, GPT-4, Gemini, Mistral, or any other — you are downstream of a GPAI provider's copyright compliance. While Art.55 creates downstream developer obligations primarily around use-of-output disclosures, there are three copyright compliance considerations for API consumers:

1. Verify Your Provider's Art.52(2) Summary

Your GPAI provider must publish a transparency summary. Verify:

A provider without a published Art.52(2) summary as of early 2026 is non-compliant with baseline GPAI obligations. This is relevant for procurement due diligence, especially in regulated sectors (financial services, healthcare, legal tech) where your clients may ask about your AI supply chain's regulatory compliance.

If your GPAI provider trained on opt-out-reserved content without authorisation, the rights holders' claim is against the provider for the training act. However:

Due diligence means being able to demonstrate your provider has a documented, policy-backed copyright compliance process — not just asserting they comply.

3. Fine-Tuning: When You Become a GPAI Provider

If you fine-tune a GPAI model on proprietary data and release it externally (via API or product), you may qualify as a GPAI provider yourself under Art.3(47). If so:

The trigger is external release — internal fine-tuning for internal use only is not GPAI model provision.


Python TDMOptOutTracker: Implementation Tooling

from dataclasses import dataclass, field
from datetime import datetime, date
from enum import Enum
from typing import Optional
import json


class OptOutSignalType(str, Enum):
    ROBOTS_TXT_BOT_SPECIFIC = "robots_txt_bot_specific"
    ROBOTS_TXT_WILDCARD = "robots_txt_wildcard"
    AI_TXT = "ai_txt"
    HTML_META_NOAI = "html_meta_noai"
    HTTP_HEADER_NOAI = "http_header_noai"
    TOS_RESERVATION = "tos_reservation"
    RETROACTIVE_NOTIFICATION = "retroactive_notification"


class OptOutScope(str, Enum):
    FULL_DOMAIN = "full_domain"
    SUBDOMAIN = "subdomain"
    PATH_PREFIX = "path_prefix"
    SINGLE_URL = "single_url"


@dataclass
class OptOutExclusionRecord:
    """Exclusion log entry per CoP Chapter 2 Commitment 2.2."""
    domain_or_url: str
    signal_type: OptOutSignalType
    signal_content: str          # Raw signal text for audit trail
    exclusion_scope: OptOutScope
    detected_at: datetime
    excluded_at: datetime
    verified_at: Optional[datetime] = None
    retroactive: bool = False    # True if signal added after initial crawl
    notes: str = ""


@dataclass
class LicenceRecord:
    """Licensing register entry per CoP Chapter 2 Commitment 2.4."""
    dataset_name: str
    rights_holder: str
    licence_type: str            # e.g. "TDM-specific non-exclusive"
    licence_scope: list[str]     # e.g. ["training", "fine-tuning"]
    licence_start: date
    licence_expiry: Optional[date]
    sublicensing_permitted: bool
    licence_document_ref: str    # internal document ID for audit


class TDMOptOutTracker:
    """
    Implements GPAI CoP Chapter 2 TDM opt-out tracking per Art.52(1)(c).
    
    Usage:
        tracker = TDMOptOutTracker(model_id="model-v3", provider="MyCompany")
        
        # During crawl pipeline
        if tracker.is_excluded(url):
            skip_url(url)
        else:
            crawl_and_check_signals(url, tracker)
    """
    
    def __init__(self, model_id: str, provider: str):
        self.model_id = model_id
        self.provider = provider
        self.exclusion_log: list[OptOutExclusionRecord] = []
        self.licence_register: list[LicenceRecord] = []
        self._excluded_domains: set[str] = set()
        self._excluded_urls: set[str] = set()
    
    def record_exclusion(self, record: OptOutExclusionRecord) -> None:
        """Record an opt-out exclusion. Updates in-memory lookup for real-time checking."""
        self.exclusion_log.append(record)
        if record.exclusion_scope == OptOutScope.FULL_DOMAIN:
            domain = self._extract_domain(record.domain_or_url)
            self._excluded_domains.add(domain)
        elif record.exclusion_scope == OptOutScope.SINGLE_URL:
            self._excluded_urls.add(record.domain_or_url)
    
    def is_excluded(self, url: str) -> bool:
        """Real-time check: is this URL covered by an opt-out exclusion?"""
        if url in self._excluded_urls:
            return True
        domain = self._extract_domain(url)
        return domain in self._excluded_domains
    
    def register_licence(self, record: LicenceRecord) -> None:
        """Register a licensing agreement for non-Art.4 covered content."""
        self.licence_register.append(record)
    
    def handle_retroactive_optout(
        self, 
        domain: str, 
        signal_type: OptOutSignalType,
        signal_content: str,
        notified_at: datetime
    ) -> dict:
        """
        CoP Commitment 2.3: process retroactive opt-out.
        Returns triage assessment for human review.
        """
        record = OptOutExclusionRecord(
            domain_or_url=domain,
            signal_type=signal_type,
            signal_content=signal_content,
            exclusion_scope=OptOutScope.FULL_DOMAIN,
            detected_at=notified_at,
            excluded_at=datetime.now(),
            retroactive=True,
        )
        self.record_exclusion(record)
        
        # Triage: estimate corpus impact (placeholder — integrate with corpus index)
        return {
            "domain": domain,
            "signal_type": signal_type.value,
            "retroactive": True,
            "action_taken": "domain_excluded_from_future_collection",
            "corpus_impact_assessment": "requires_corpus_index_query",
            "recommendation": "query corpus index for domain presence; if >1% of training tokens, escalate to legal",
            "recorded_at": datetime.now().isoformat(),
        }
    
    def generate_art52_2_summary(self) -> dict:
        """
        Generate Art.52(2) training data transparency summary — copyright section.
        """
        exclusions_by_type: dict[str, int] = {}
        for record in self.exclusion_log:
            key = record.signal_type.value
            exclusions_by_type[key] = exclusions_by_type.get(key, 0) + 1
        
        retroactive_count = sum(1 for r in self.exclusion_log if r.retroactive)
        
        return {
            "model_id": self.model_id,
            "provider": self.provider,
            "generated_at": datetime.now().isoformat(),
            "copyright_compliance": {
                "framework": "DSM Directive (EU) 2019/790 Art.4(3)",
                "eu_ai_act_obligation": "Art.52(1)(c)",
                "cop_chapter": "Chapter 2",
                "opt_out_signals_monitored": [
                    "robots.txt (bot-specific and wildcard Disallow)",
                    "ai.txt",
                    "HTML meta noai/noimageai",
                    "HTTP X-Robots-Tag noai",
                    "Rights holder retroactive notifications",
                ],
                "total_exclusions": len(self.exclusion_log),
                "exclusions_by_signal_type": exclusions_by_type,
                "retroactive_optout_count": retroactive_count,
                "licensed_datasets": len(self.licence_register),
                "licence_register_available_on_request": True,
            }
        }
    
    def assess_compliance_readiness(self) -> dict:
        """
        Check whether CoP Chapter 2 commitments are documentably met.
        """
        issues = []
        
        if len(self.exclusion_log) == 0:
            issues.append("CRITICAL: No exclusion log entries — opt-out detection pipeline not confirmed active")
        
        has_robots_txt = any(
            r.signal_type in (OptOutSignalType.ROBOTS_TXT_BOT_SPECIFIC, OptOutSignalType.ROBOTS_TXT_WILDCARD)
            for r in self.exclusion_log
        )
        if not has_robots_txt:
            issues.append("WARNING: No robots.txt opt-out detections logged — confirm crawler parses robots.txt")
        
        unverified = [r for r in self.exclusion_log if r.verified_at is None]
        if len(unverified) > 0:
            issues.append(f"INFO: {len(unverified)} exclusion records lack verification timestamp (CoP Commitment 2.2)")
        
        return {
            "compliance_ready": len(issues) == 0,
            "issues": issues,
            "exclusion_log_count": len(self.exclusion_log),
            "licence_register_count": len(self.licence_register),
            "art52_2_summary_generatable": True,
        }
    
    @staticmethod
    def _extract_domain(url: str) -> str:
        from urllib.parse import urlparse
        parsed = urlparse(url)
        return parsed.netloc or url


# Example: pre-training copyright compliance check
def demo_training_pipeline_integration():
    tracker = TDMOptOutTracker(model_id="foundation-v1", provider="ExampleAI")
    
    # Simulate: robots.txt opt-out detected during crawl
    tracker.record_exclusion(OptOutExclusionRecord(
        domain_or_url="https://news-publisher.eu",
        signal_type=OptOutSignalType.ROBOTS_TXT_BOT_SPECIFIC,
        signal_content="User-agent: CCBot\nDisallow: /",
        exclusion_scope=OptOutScope.FULL_DOMAIN,
        detected_at=datetime(2025, 3, 15, 10, 0),
        excluded_at=datetime(2025, 3, 15, 10, 0),
        verified_at=datetime(2025, 3, 15, 10, 1),
    ))
    
    # Simulate: retroactive opt-out from publishers association
    result = tracker.handle_retroactive_optout(
        domain="creative-archive.org",
        signal_type=OptOutSignalType.RETROACTIVE_NOTIFICATION,
        signal_content="Formal TDM opt-out notification from European Publishers Council",
        notified_at=datetime(2025, 9, 20, 14, 30),
    )
    
    # Simulate: licensed dataset
    tracker.register_licence(LicenceRecord(
        dataset_name="Scientific Literature Archive Q1 2025",
        rights_holder="EuropeanResearchPublishers",
        licence_type="TDM non-exclusive perpetual",
        licence_scope=["training", "fine-tuning"],
        licence_start=date(2025, 1, 1),
        licence_expiry=None,
        sublicensing_permitted=False,
        licence_document_ref="LIC-2025-003",
    ))
    
    # Check compliance and generate Art.52(2) summary
    readiness = tracker.assess_compliance_readiness()
    summary = tracker.generate_art52_2_summary()
    
    print(json.dumps(readiness, indent=2))
    print(json.dumps(summary, indent=2))


if __name__ == "__main__":
    demo_training_pipeline_integration()

The TDMOptOutTracker implements the three audit-critical CoP Chapter 2 functions: real-time exclusion checking during crawl, retroactive opt-out triage, and Art.52(2) summary generation. In a production training pipeline, the exclusion log would be backed by a database, and is_excluded() would query a distributed cache updated by the crawl fleet.


Enforcement: What Non-Compliance Looks Like

Three enforcement paths exist for GPAI Art.52(1)(c) copyright non-compliance:

AI Office enforcement (Art.88): The AI Office can request documentation of a GPAI provider's copyright compliance policy under Art.52(1)(c). A provider that cannot produce a documented, implemented opt-out detection policy — or whose Art.52(2) transparency summary is absent or insufficient — faces proceedings under Art.99(4): up to 3% of global annual turnover.

Member state market surveillance (Art.74): National authorities enforce Art.52 compliance for providers active in their jurisdiction. A rights holder who suspects their opt-out was violated can file a complaint triggering a market surveillance investigation.

Civil copyright litigation: Separate from EU AI Act enforcement, rights holders retain their DSM Directive civil claims. Art.4(3) opt-out violation means the provider cannot rely on the Art.4 TDM exception — the training act is copyright infringement. The EU AI Act's transparency requirements (Art.52(2)) may make it easier for rights holders to identify infringement, because providers must now disclose training corpus categories.

The interaction between AI Office enforcement and civil litigation creates a compliance multiplier: a provider who cannot demonstrate opt-out compliance to the AI Office is also likely to face civil claims from rights holders.


EU Jurisdiction and Infrastructure: Why Hosting Matters

For GPAI providers that train models on EU infrastructure, the copyright analysis is straightforward: DSM Directive Art.4 applies.

For providers that train on non-EU infrastructure (US data centres, for example) but place models on the EU market, the question of whether DSM Directive Art.4 applies to the training act has been contested. The AI Office's interpretive guidance (published Q4 2025) clarifies:

Art.52(1)(c) applies based on market placement, not training location. A GPAI model placed on the EU market must have a copyright compliance policy that complies with "Union copyright law" — the DSM Directive — regardless of where the training occurred.

This means GPAI providers with US-based training infrastructure must either:

  1. Have retroactively verified that their training data collection respected DSM Art.4(3) opt-outs
  2. Document the impossibility of retroactive verification and implement prospective compliance for future model versions
  3. Obtain retroactive licensing for high-significance rights-reserved content in their corpus

For providers using EU sovereign infrastructure (data centres subject to EU law, with EU data processing chains), the copyright compliance documentation is verifiably governed by EU law from the start. Cloud Act exposure does not affect the training data record — US government access requests cannot reach training data documentation stored in EU-jurisdiction infrastructure.


Pre-Training Data Collection

Exclusion Log (CoP Commitment 2.2)

Retroactive Opt-Out Handling (CoP Commitment 2.3)

Licensing Register (CoP Commitment 2.4)

Art.52(2) Transparency Summary (CoP Commitment 2.5)


What "State of the Art Technologies" Means in Practice

Art.52(1)(c) specifically requires opt-out detection using "state of the art technologies." The AI Office's interpretive guidance identifies four implementation requirements for state of the art compliance:

Real-time detection: Opt-out signals must be checked at crawl time, not after content is collected. Post-hoc filtering of already-collected content does not satisfy Art.52(1)(c) because the opt-out was not respected at the point of collection.

Signal coverage breadth: Detection must cover all established machine-readable formats (robots.txt, ai.txt, meta tags, HTTP headers). A system that only checks robots.txt while ignoring ai.txt and meta tags is not state of the art as of 2026.

Update cadence: As new opt-out signal formats emerge, the detection system must be updated. The AI Office expects GPAI providers to monitor the technical evolution of TDM opt-out standards and update their detection systems within a reasonable period of new format adoption.

Retroactive signal monitoring: State of the art includes monitoring for retroactive signals from rights holder organisations (publishers' associations, creator groups) who issue bulk opt-out notifications on behalf of members.

A paper policy that says "we check robots.txt" without verifiable technical implementation does not meet the "state of the art" standard.


Connection to Art.18 Documentation Retention

Art.52(1)(c) copyright compliance documentation, the exclusion log, and the licensing register must be retained consistent with Art.18's general GPAI documentation retention requirement. For GPAI models, Art.18 requires documentation retention for 10 years after market placement — not 10 years after the training run.

This means:

Cloud Act risk applies here: copyright compliance documentation stored on US-jurisdiction infrastructure is subject to US government access requests under the Cloud Stored Communications Act. For EU-jurisdiction GPAI providers (or providers asserting EU law compliance), training data documentation stored in EU sovereign infrastructure eliminates this exposure.


Summary

GPAI CoP Chapter 2 transforms EU AI Act Art.52(1)(c)'s copyright compliance obligation from an aspiration into an audit-ready engineering requirement:

  1. Real-time opt-out detection during training data collection — robots.txt, ai.txt, meta tags, HTTP headers
  2. Exclusion log maintained for each detected signal, with verification timestamps
  3. Retroactive opt-out handling process with corpus impact triage
  4. Licensing register for non-Art.4-covered content
  5. Art.52(2) transparency summary published publicly and updated regularly

For SaaS developers integrating GPAI APIs: verify your provider's transparency summary before procurement, especially in regulated sectors. For fine-tuners releasing externally: your web-scraped fine-tuning data is subject to the same Art.4(3) obligations.

The enforcement multiplier — AI Office proceedings plus civil copyright claims — makes Art.52(1)(c) non-compliance a material legal risk, not just a box-ticking exercise.

EU sovereign infrastructure note: Training data and compliance documentation hosted on EU-jurisdiction infrastructure (without Cloud Act exposure) creates a clean audit trail that can be disclosed to EU regulators without cross-border data transfer complications. See how sota.io supports EU-compliant AI infrastructure →