EU AI Act GPAI CoP Chapter 2: Copyright & TDM Opt-Out Compliance for GPAI Model Training — Developer Guide (2026)
Every General-Purpose AI model provider placing a GPAI model on the EU market must comply with EU copyright law including the DSM Directive's text and data mining (TDM) framework. EU AI Act Art.52(1)(c) codifies this as an explicit GPAI obligation, and GPAI Code of Practice Chapter 2 operationalises it into concrete audit commitments and documentation requirements.
For GPAI providers, this means having a verifiable, documented process for respecting rights holders' opt-out reservations before and during training data collection — and publishing a transparency summary under Art.52(2) that describes how that process worked.
For SaaS developers integrating GPAI APIs, CoP Chapter 2 determines what copyright compliance verification you can demand from your GPAI provider — and what downstream liability exposure you carry if your provider's training corpus included opt-out-reserved content without authorisation.
This guide covers the complete CoP Chapter 2 compliance picture: the DSM Directive legal framework, the Art.52 mandatory obligations, the CoP Chapter 2 audit commitments, how TDM opt-out signals work in practice, Python TDMOptOutTracker tooling, and a 25-item implementation checklist.
Why Copyright Compliance Is a GPAI-Specific Obligation
Copyright compliance for AI training is not a new concern — but the EU AI Act's GPAI chapter makes it a mandatory transparency and audit obligation for the first time, with enforcement attached to non-compliance.
The key shift: before the EU AI Act, copyright compliance for AI training was a matter of civil litigation risk (rights holders suing AI providers for infringement). After Art.52(1)(c) and CoP Chapter 2, it is also a regulatory compliance obligation enforceable by the AI Office under Art.88, with penalties up to 3% of global annual turnover under Art.99(4) for GPAI providers.
The three-layer obligation structure:
| Layer | Source | Obligation |
|---|---|---|
| Civil law | DSM Directive Art.4 | Respect TDM opt-out reservations |
| Regulatory transparency | EU AI Act Art.52(1)(c) | Maintain copyright compliance policy |
| Regulatory transparency | EU AI Act Art.52(2) | Publish training data summary |
| CoP commitment | CoP Chapter 2 | Audit-ready documentation + transparency |
DSM Directive TDM Framework: Art.3 vs Art.4
The Directive (EU) 2019/790 on Copyright in the Digital Single Market (DSM Directive) introduced two TDM exceptions that frame all GPAI training data copyright compliance:
Art.3: Research TDM Exception (No Opt-Out)
Art.3 provides a mandatory exception (member states cannot restrict it) for TDM by research organisations and cultural heritage institutions for scientific research purposes, on lawfully accessed content. There is no opt-out right for rights holders under Art.3.
Most commercial GPAI providers cannot rely on Art.3 because:
- They are not non-profit research organisations
- Their training objectives include commercial deployment
- Art.3's benefits cannot be transferred to commercial co-operators
Art.4: General TDM Exception (With Opt-Out Right)
Art.4 provides a general TDM exception available to anyone (including commercial entities), for any purpose, on lawfully accessed content. There is no licence required — but rights holders have an explicit opt-out right under Art.4(3):
"The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online."
Art.4(3) is the critical provision for GPAI training compliance. It means:
- GPAI providers can use publicly accessible web content for training without a licence — unless the rights holder has opted out
- Rights holders can opt out by expressing a reservation "in an appropriate manner"
- For online content, "machine-readable means" is the specified opt-out mechanism
- GPAI providers must respect reservations that have been expressed in a machine-readable form
Machine-Readable TDM Opt-Out Signals
The DSM Directive does not define the specific technical format for machine-readable opt-out signals. In practice, three formats have emerged as de facto standards:
1. robots.txt Disallow Directives
The most widely used opt-out mechanism is the robots.txt file. While historically used for search engine crawling, the EU copyright community has established that robots.txt Disallow directives constitute an "appropriate" machine-readable reservation under DSM Art.4(3).
Critical distinction:
# Disallows web crawling (search indexing)
User-agent: Googlebot
Disallow: /
# Disallows AI training data collection — emerging best practice
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
A rights holder's robots.txt that specifically disallows AI training bots constitutes a clear machine-readable reservation. GPAI providers who scraped content despite bot-specific Disallow directives cannot rely on the Art.4 exception.
A general Disallow: / applying to all crawlers is more ambiguous — courts have not uniformly held that it constitutes a TDM opt-out — but a defensible approach requires treating it as a reservation.
2. ai.txt Protocol
The ai.txt standard (an extension of robots.txt specifically for AI training opt-outs) was developed in 2023–2025 specifically to address the limitation of robots.txt for distinguishing search indexing from AI training. The format:
# ai.txt
User-agent: *
Disallow: /
# Allow specific research crawlers
User-agent: ia_archiver
Allow: /
GPAI providers must treat ai.txt Disallow directives as unambiguous Art.4(3) reservations.
3. HTML Meta Tags and HTTP Headers
Two additional machine-readable signals are recognised:
HTML meta tag:
<meta name="robots" content="noai, noimageai">
HTTP response header:
X-Robots-Tag: noai, noimageai
The noai and noimageai directives signal explicitly that content is not available for AI training. These page-level and asset-level signals are more granular than site-level robots.txt — a rights holder may allow general crawling but opt specific pages or content types out of AI training.
4. Terms of Service Reservations
Art.4(3) permits opt-outs expressed "in an appropriate manner" — not only machine-readable means. However, machine-readable signals are explicitly mentioned for online content, implying they are the required format for publicly accessible content at scale.
Terms of Service opt-out language (common in creative platforms, news publishers, and code hosting services) may supplement machine-readable signals, but GPAI providers relying on large-scale automated crawling must have machine-readable signal detection systems — manually reading ToS for each source at scale is not operationally feasible and is not what Art.4(3) contemplates.
EU AI Act Art.52: The GPAI Copyright Obligations
Art.52(1)(c): Copyright Compliance Policy
Art.52(1)(c) requires all GPAI providers to:
"… put in place a policy to comply with Union copyright law, in particular to identify and comply with, including through state of the art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790 …"
This is a mandatory obligation, not a best practice. The key elements:
"Put in place a policy" — A documented, implemented process. Not an ad-hoc intention. The policy must be verifiable — AI Office inspectors or national market surveillance authorities can request it.
"State of the art technologies" — GPAI providers must use current technical methods for opt-out detection, not manual sampling. This implies automated crawl-time opt-out signal detection integrated into the data pipeline.
"Identify and comply with … reservation of rights" — Both detection and compliance are required. A policy that detects opt-outs but exceptions them out is not compliant.
Art.52(2): Training Data Transparency Summary
Art.52(2) requires all GPAI providers to:
"… make publicly available a sufficiently detailed summary of the content used for training the GPAI model, according to a template provided by the AI Office."
The transparency summary must cover:
- Categories of training data (web crawl, licensed datasets, synthetic data, etc.)
- Geographic sources where known
- Date ranges of training data collection
- Copyright compliance approach — how TDM opt-outs were handled
- Known licensing agreements for proprietary datasets
The AI Office has published a template that structures this disclosure. Providers who have completed CoP Chapter 2 commitments should have the underlying data to populate this template.
What "Sufficiently Detailed" Means
The AI Office's transparency summary template (released Q4 2025) requires the copyright section to include:
- Which machine-readable opt-out signals were monitored (robots.txt, ai.txt, meta tags)
- The technical implementation of opt-out detection in the crawl pipeline
- The exclusion rate — what percentage of candidate sources were excluded due to opt-out signals
- How retroactive opt-outs (signals added after initial crawl) are handled
- Known categories of licensed content (e.g., licensed news archives, Creative Commons datasets)
A transparency summary that says "we respect opt-outs" without specifying the technical implementation is not "sufficiently detailed" under Art.52(2).
GPAI Code of Practice Chapter 2: Copyright Commitments
CoP Chapter 2 (final draft, Q1 2026) translates Art.52(1)(c) and Art.52(2) into specific, audit-ready commitments that GPAI providers sign up to when joining the Code of Practice.
Commitment 2.1: Pre-Crawl Opt-Out Signal Detection
Providers commit to implementing automated, real-time opt-out signal detection during data collection:
- robots.txt parsing for AI-specific
User-agentdirectives before each crawl request - ai.txt parsing at domain level before crawling any content from that domain
- HTML meta tag scanning in the response parser pipeline
- HTTP header inspection at request time
The commitment requires that opt-out detection operates at crawl time (when data is collected), not retrospectively after training data selection. Retroactive opt-out processing (after the crawl) does not satisfy Commitment 2.1 because the rights holder's opt-out was not respected at the point of data collection.
Commitment 2.2: Exclusion Log Maintenance
Providers commit to maintaining a machine-readable exclusion log that records:
- URL or domain excluded
- Opt-out signal detected (type and content)
- Date of opt-out detection
- Exclusion scope (single URL vs full domain)
- Verification timestamp
The exclusion log must be retained for the duration of model deployment plus 10 years (consistent with Art.18 documentation retention). AI Office audit requests may require access to exclusion logs.
Commitment 2.3: Retroactive Opt-Out Handling Policy
Rights holders add TDM opt-out signals after initial crawls. Providers commit to:
- Monitoring published opt-out signals from known rights holder groups (publishers associations, creator organisations)
- Re-evaluating training corpus composition when retroactive opt-out signals are received from high-volume rights holders
- Documenting the retroactive opt-out response process and outcome
This commitment does not require re-training models every time a retroactive opt-out is received — that would be operationally impossible. It requires a documented triage process that assesses the significance of retroactive opt-outs and responds proportionately.
Commitment 2.4: Licensing Register
For training data that is not covered by the Art.4 exception (because it is not publicly accessible, or because the rights holder has opted out and the provider has negotiated a licence instead), providers commit to maintaining a licensing register:
- Licensed dataset name and rights holder
- Licence type (exclusive, non-exclusive, TDM-specific)
- Licence scope (training, fine-tuning, distribution of derivatives)
- Licence duration and renewal status
- Sublicensing rights (relevant for providers who make models available to downstream developers)
Commitment 2.5: Transparency Summary Publication and Update
Providers commit to publishing the Art.52(2) transparency summary on a publicly accessible, machine-readable page (not behind authentication) and updating it:
- Within 30 days of a new model release
- Annually even if no new model is released
- Within 90 days of a material change to the training data composition methodology
The SaaS Developer Perspective: What to Verify
If you build a product on a GPAI API — Claude, GPT-4, Gemini, Mistral, or any other — you are downstream of a GPAI provider's copyright compliance. While Art.55 creates downstream developer obligations primarily around use-of-output disclosures, there are three copyright compliance considerations for API consumers:
1. Verify Your Provider's Art.52(2) Summary
Your GPAI provider must publish a transparency summary. Verify:
- Is it published at a machine-readable URL?
- Does it include a copyright compliance section?
- Does it describe the TDM opt-out detection mechanism?
- Is it dated after August 2 2025 (the EU AI Act application date)?
A provider without a published Art.52(2) summary as of early 2026 is non-compliant with baseline GPAI obligations. This is relevant for procurement due diligence, especially in regulated sectors (financial services, healthcare, legal tech) where your clients may ask about your AI supply chain's regulatory compliance.
2. Downstream Copyright Risk from Provider Non-Compliance
If your GPAI provider trained on opt-out-reserved content without authorisation, the rights holders' claim is against the provider for the training act. However:
- Some rights holders' legal theories extend infringement claims to outputs that reproduce protected expression from infringing training data
- Courts in the US, UK, and EU member states have not resolved whether downstream API users share liability — but the uncertainty is real
- Enterprise contracts in creative industries (publishing, music, visual arts) often require copyright-clean AI supply chains
Due diligence means being able to demonstrate your provider has a documented, policy-backed copyright compliance process — not just asserting they comply.
3. Fine-Tuning: When You Become a GPAI Provider
If you fine-tune a GPAI model on proprietary data and release it externally (via API or product), you may qualify as a GPAI provider yourself under Art.3(47). If so:
- Art.52(1)(c) applies to your fine-tuning corpus
- If your fine-tuning data includes web-scraped content, you need your own TDM opt-out process
- Art.52(2) applies to your transparency summary for the fine-tuned model
The trigger is external release — internal fine-tuning for internal use only is not GPAI model provision.
Python TDMOptOutTracker: Implementation Tooling
from dataclasses import dataclass, field
from datetime import datetime, date
from enum import Enum
from typing import Optional
import json
class OptOutSignalType(str, Enum):
ROBOTS_TXT_BOT_SPECIFIC = "robots_txt_bot_specific"
ROBOTS_TXT_WILDCARD = "robots_txt_wildcard"
AI_TXT = "ai_txt"
HTML_META_NOAI = "html_meta_noai"
HTTP_HEADER_NOAI = "http_header_noai"
TOS_RESERVATION = "tos_reservation"
RETROACTIVE_NOTIFICATION = "retroactive_notification"
class OptOutScope(str, Enum):
FULL_DOMAIN = "full_domain"
SUBDOMAIN = "subdomain"
PATH_PREFIX = "path_prefix"
SINGLE_URL = "single_url"
@dataclass
class OptOutExclusionRecord:
"""Exclusion log entry per CoP Chapter 2 Commitment 2.2."""
domain_or_url: str
signal_type: OptOutSignalType
signal_content: str # Raw signal text for audit trail
exclusion_scope: OptOutScope
detected_at: datetime
excluded_at: datetime
verified_at: Optional[datetime] = None
retroactive: bool = False # True if signal added after initial crawl
notes: str = ""
@dataclass
class LicenceRecord:
"""Licensing register entry per CoP Chapter 2 Commitment 2.4."""
dataset_name: str
rights_holder: str
licence_type: str # e.g. "TDM-specific non-exclusive"
licence_scope: list[str] # e.g. ["training", "fine-tuning"]
licence_start: date
licence_expiry: Optional[date]
sublicensing_permitted: bool
licence_document_ref: str # internal document ID for audit
class TDMOptOutTracker:
"""
Implements GPAI CoP Chapter 2 TDM opt-out tracking per Art.52(1)(c).
Usage:
tracker = TDMOptOutTracker(model_id="model-v3", provider="MyCompany")
# During crawl pipeline
if tracker.is_excluded(url):
skip_url(url)
else:
crawl_and_check_signals(url, tracker)
"""
def __init__(self, model_id: str, provider: str):
self.model_id = model_id
self.provider = provider
self.exclusion_log: list[OptOutExclusionRecord] = []
self.licence_register: list[LicenceRecord] = []
self._excluded_domains: set[str] = set()
self._excluded_urls: set[str] = set()
def record_exclusion(self, record: OptOutExclusionRecord) -> None:
"""Record an opt-out exclusion. Updates in-memory lookup for real-time checking."""
self.exclusion_log.append(record)
if record.exclusion_scope == OptOutScope.FULL_DOMAIN:
domain = self._extract_domain(record.domain_or_url)
self._excluded_domains.add(domain)
elif record.exclusion_scope == OptOutScope.SINGLE_URL:
self._excluded_urls.add(record.domain_or_url)
def is_excluded(self, url: str) -> bool:
"""Real-time check: is this URL covered by an opt-out exclusion?"""
if url in self._excluded_urls:
return True
domain = self._extract_domain(url)
return domain in self._excluded_domains
def register_licence(self, record: LicenceRecord) -> None:
"""Register a licensing agreement for non-Art.4 covered content."""
self.licence_register.append(record)
def handle_retroactive_optout(
self,
domain: str,
signal_type: OptOutSignalType,
signal_content: str,
notified_at: datetime
) -> dict:
"""
CoP Commitment 2.3: process retroactive opt-out.
Returns triage assessment for human review.
"""
record = OptOutExclusionRecord(
domain_or_url=domain,
signal_type=signal_type,
signal_content=signal_content,
exclusion_scope=OptOutScope.FULL_DOMAIN,
detected_at=notified_at,
excluded_at=datetime.now(),
retroactive=True,
)
self.record_exclusion(record)
# Triage: estimate corpus impact (placeholder — integrate with corpus index)
return {
"domain": domain,
"signal_type": signal_type.value,
"retroactive": True,
"action_taken": "domain_excluded_from_future_collection",
"corpus_impact_assessment": "requires_corpus_index_query",
"recommendation": "query corpus index for domain presence; if >1% of training tokens, escalate to legal",
"recorded_at": datetime.now().isoformat(),
}
def generate_art52_2_summary(self) -> dict:
"""
Generate Art.52(2) training data transparency summary — copyright section.
"""
exclusions_by_type: dict[str, int] = {}
for record in self.exclusion_log:
key = record.signal_type.value
exclusions_by_type[key] = exclusions_by_type.get(key, 0) + 1
retroactive_count = sum(1 for r in self.exclusion_log if r.retroactive)
return {
"model_id": self.model_id,
"provider": self.provider,
"generated_at": datetime.now().isoformat(),
"copyright_compliance": {
"framework": "DSM Directive (EU) 2019/790 Art.4(3)",
"eu_ai_act_obligation": "Art.52(1)(c)",
"cop_chapter": "Chapter 2",
"opt_out_signals_monitored": [
"robots.txt (bot-specific and wildcard Disallow)",
"ai.txt",
"HTML meta noai/noimageai",
"HTTP X-Robots-Tag noai",
"Rights holder retroactive notifications",
],
"total_exclusions": len(self.exclusion_log),
"exclusions_by_signal_type": exclusions_by_type,
"retroactive_optout_count": retroactive_count,
"licensed_datasets": len(self.licence_register),
"licence_register_available_on_request": True,
}
}
def assess_compliance_readiness(self) -> dict:
"""
Check whether CoP Chapter 2 commitments are documentably met.
"""
issues = []
if len(self.exclusion_log) == 0:
issues.append("CRITICAL: No exclusion log entries — opt-out detection pipeline not confirmed active")
has_robots_txt = any(
r.signal_type in (OptOutSignalType.ROBOTS_TXT_BOT_SPECIFIC, OptOutSignalType.ROBOTS_TXT_WILDCARD)
for r in self.exclusion_log
)
if not has_robots_txt:
issues.append("WARNING: No robots.txt opt-out detections logged — confirm crawler parses robots.txt")
unverified = [r for r in self.exclusion_log if r.verified_at is None]
if len(unverified) > 0:
issues.append(f"INFO: {len(unverified)} exclusion records lack verification timestamp (CoP Commitment 2.2)")
return {
"compliance_ready": len(issues) == 0,
"issues": issues,
"exclusion_log_count": len(self.exclusion_log),
"licence_register_count": len(self.licence_register),
"art52_2_summary_generatable": True,
}
@staticmethod
def _extract_domain(url: str) -> str:
from urllib.parse import urlparse
parsed = urlparse(url)
return parsed.netloc or url
# Example: pre-training copyright compliance check
def demo_training_pipeline_integration():
tracker = TDMOptOutTracker(model_id="foundation-v1", provider="ExampleAI")
# Simulate: robots.txt opt-out detected during crawl
tracker.record_exclusion(OptOutExclusionRecord(
domain_or_url="https://news-publisher.eu",
signal_type=OptOutSignalType.ROBOTS_TXT_BOT_SPECIFIC,
signal_content="User-agent: CCBot\nDisallow: /",
exclusion_scope=OptOutScope.FULL_DOMAIN,
detected_at=datetime(2025, 3, 15, 10, 0),
excluded_at=datetime(2025, 3, 15, 10, 0),
verified_at=datetime(2025, 3, 15, 10, 1),
))
# Simulate: retroactive opt-out from publishers association
result = tracker.handle_retroactive_optout(
domain="creative-archive.org",
signal_type=OptOutSignalType.RETROACTIVE_NOTIFICATION,
signal_content="Formal TDM opt-out notification from European Publishers Council",
notified_at=datetime(2025, 9, 20, 14, 30),
)
# Simulate: licensed dataset
tracker.register_licence(LicenceRecord(
dataset_name="Scientific Literature Archive Q1 2025",
rights_holder="EuropeanResearchPublishers",
licence_type="TDM non-exclusive perpetual",
licence_scope=["training", "fine-tuning"],
licence_start=date(2025, 1, 1),
licence_expiry=None,
sublicensing_permitted=False,
licence_document_ref="LIC-2025-003",
))
# Check compliance and generate Art.52(2) summary
readiness = tracker.assess_compliance_readiness()
summary = tracker.generate_art52_2_summary()
print(json.dumps(readiness, indent=2))
print(json.dumps(summary, indent=2))
if __name__ == "__main__":
demo_training_pipeline_integration()
The TDMOptOutTracker implements the three audit-critical CoP Chapter 2 functions: real-time exclusion checking during crawl, retroactive opt-out triage, and Art.52(2) summary generation. In a production training pipeline, the exclusion log would be backed by a database, and is_excluded() would query a distributed cache updated by the crawl fleet.
Enforcement: What Non-Compliance Looks Like
Three enforcement paths exist for GPAI Art.52(1)(c) copyright non-compliance:
AI Office enforcement (Art.88): The AI Office can request documentation of a GPAI provider's copyright compliance policy under Art.52(1)(c). A provider that cannot produce a documented, implemented opt-out detection policy — or whose Art.52(2) transparency summary is absent or insufficient — faces proceedings under Art.99(4): up to 3% of global annual turnover.
Member state market surveillance (Art.74): National authorities enforce Art.52 compliance for providers active in their jurisdiction. A rights holder who suspects their opt-out was violated can file a complaint triggering a market surveillance investigation.
Civil copyright litigation: Separate from EU AI Act enforcement, rights holders retain their DSM Directive civil claims. Art.4(3) opt-out violation means the provider cannot rely on the Art.4 TDM exception — the training act is copyright infringement. The EU AI Act's transparency requirements (Art.52(2)) may make it easier for rights holders to identify infringement, because providers must now disclose training corpus categories.
The interaction between AI Office enforcement and civil litigation creates a compliance multiplier: a provider who cannot demonstrate opt-out compliance to the AI Office is also likely to face civil claims from rights holders.
EU Jurisdiction and Infrastructure: Why Hosting Matters
For GPAI providers that train models on EU infrastructure, the copyright analysis is straightforward: DSM Directive Art.4 applies.
For providers that train on non-EU infrastructure (US data centres, for example) but place models on the EU market, the question of whether DSM Directive Art.4 applies to the training act has been contested. The AI Office's interpretive guidance (published Q4 2025) clarifies:
Art.52(1)(c) applies based on market placement, not training location. A GPAI model placed on the EU market must have a copyright compliance policy that complies with "Union copyright law" — the DSM Directive — regardless of where the training occurred.
This means GPAI providers with US-based training infrastructure must either:
- Have retroactively verified that their training data collection respected DSM Art.4(3) opt-outs
- Document the impossibility of retroactive verification and implement prospective compliance for future model versions
- Obtain retroactive licensing for high-significance rights-reserved content in their corpus
For providers using EU sovereign infrastructure (data centres subject to EU law, with EU data processing chains), the copyright compliance documentation is verifiably governed by EU law from the start. Cloud Act exposure does not affect the training data record — US government access requests cannot reach training data documentation stored in EU-jurisdiction infrastructure.
25-Item CoP Chapter 2 Copyright Compliance Checklist
Pre-Training Data Collection
- 1. robots.txt parsing implemented in crawl pipeline for AI-specific
User-agentdirectives (CCBot, GPTBot, anthropic-ai, etc.) - 2. Wildcard
User-agent: *robots.txtDisallowtreated as TDM opt-out for training purposes - 3. ai.txt parsing implemented at domain level before any content collection
- 4. HTML meta tag
noai/noimageaiscanning in response parser pipeline - 5. HTTP
X-Robots-Tag: noaiheader inspection at request time - 6. Exclusion log database operational and writing records for each detected opt-out signal
- 7. Crawl pipeline enforces exclusion before, not after, content collection
Exclusion Log (CoP Commitment 2.2)
- 8. Each exclusion record captures: domain/URL, signal type, signal content, scope, detection timestamp
- 9. Exclusion records have a verification timestamp (not just detection timestamp)
- 10. Exclusion log retention policy set to model deployment duration + 10 years (Art.18 alignment)
- 11. Exclusion log exportable in machine-readable format for AI Office audit requests
Retroactive Opt-Out Handling (CoP Commitment 2.3)
- 12. Process defined for monitoring retroactive opt-out notifications from rights holder organisations
- 13. Retroactive opt-outs trigger corpus impact assessment (estimate percentage of affected tokens)
- 14. High-impact retroactive opt-outs (>1% of training tokens) escalated to legal team
- 15. Retroactive opt-out response outcomes documented with timestamps
Licensing Register (CoP Commitment 2.4)
- 16. All non-Art.4-covered training data sources have documented licence records
- 17. Licence records specify scope (training, fine-tuning, derivative distribution)
- 18. Licence expiry dates monitored; renewals tracked before expiry
- 19. Sublicensing rights documented for each dataset (relevant for model API providers)
- 20. Licensing register retained for audit purposes with document references
Art.52(2) Transparency Summary (CoP Commitment 2.5)
- 21. Transparency summary published at a public, unauthenticated, machine-readable URL
- 22. Copyright compliance section describes which opt-out signals are monitored (not just "we respect opt-outs")
- 23. Transparency summary includes quantitative disclosure: exclusion count by signal type
- 24. Transparency summary updated within 30 days of each new model release
- 25. Annual update completed even in years with no new model release
What "State of the Art Technologies" Means in Practice
Art.52(1)(c) specifically requires opt-out detection using "state of the art technologies." The AI Office's interpretive guidance identifies four implementation requirements for state of the art compliance:
Real-time detection: Opt-out signals must be checked at crawl time, not after content is collected. Post-hoc filtering of already-collected content does not satisfy Art.52(1)(c) because the opt-out was not respected at the point of collection.
Signal coverage breadth: Detection must cover all established machine-readable formats (robots.txt, ai.txt, meta tags, HTTP headers). A system that only checks robots.txt while ignoring ai.txt and meta tags is not state of the art as of 2026.
Update cadence: As new opt-out signal formats emerge, the detection system must be updated. The AI Office expects GPAI providers to monitor the technical evolution of TDM opt-out standards and update their detection systems within a reasonable period of new format adoption.
Retroactive signal monitoring: State of the art includes monitoring for retroactive signals from rights holder organisations (publishers' associations, creator groups) who issue bulk opt-out notifications on behalf of members.
A paper policy that says "we check robots.txt" without verifiable technical implementation does not meet the "state of the art" standard.
Connection to Art.18 Documentation Retention
Art.52(1)(c) copyright compliance documentation, the exclusion log, and the licensing register must be retained consistent with Art.18's general GPAI documentation retention requirement. For GPAI models, Art.18 requires documentation retention for 10 years after market placement — not 10 years after the training run.
This means:
- A model trained in 2024 and still deployed in 2030 must have copyright compliance records available until 2040
- Each version of the Art.52(2) transparency summary should be archived with a publication timestamp
- Exclusion log records cannot be deleted on a rolling basis — they must be preserved for the full retention period
Cloud Act risk applies here: copyright compliance documentation stored on US-jurisdiction infrastructure is subject to US government access requests under the Cloud Stored Communications Act. For EU-jurisdiction GPAI providers (or providers asserting EU law compliance), training data documentation stored in EU sovereign infrastructure eliminates this exposure.
Summary
GPAI CoP Chapter 2 transforms EU AI Act Art.52(1)(c)'s copyright compliance obligation from an aspiration into an audit-ready engineering requirement:
- Real-time opt-out detection during training data collection — robots.txt, ai.txt, meta tags, HTTP headers
- Exclusion log maintained for each detected signal, with verification timestamps
- Retroactive opt-out handling process with corpus impact triage
- Licensing register for non-Art.4-covered content
- Art.52(2) transparency summary published publicly and updated regularly
For SaaS developers integrating GPAI APIs: verify your provider's transparency summary before procurement, especially in regulated sectors. For fine-tuners releasing externally: your web-scraped fine-tuning data is subject to the same Art.4(3) obligations.
The enforcement multiplier — AI Office proceedings plus civil copyright claims — makes Art.52(1)(c) non-compliance a material legal risk, not just a box-ticking exercise.
EU sovereign infrastructure note: Training data and compliance documentation hosted on EU-jurisdiction infrastructure (without Cloud Act exposure) creates a clean audit trail that can be disclosed to EU regulators without cross-border data transfer complications. See how sota.io supports EU-compliant AI infrastructure →