EU AI Act Art.52 GPAI Model General Obligations: Technical Documentation, Training Data & Copyright — Developer Guide (2026)
EU AI Act Chapter V imposes a tiered obligation structure on General-Purpose AI (GPAI) model providers. Article 51 classifies GPAI models into two tiers based on training compute — general GPAI models and GPAI models with systemic risk. Article 52 establishes the baseline obligations that apply to every GPAI model provider regardless of tier. These are not optional enhancements: they are the minimum compliance floor for any organisation that trains, fine-tunes, or distributes a GPAI model in or into the EU market.
Chapter V of the EU AI Act became applicable on 2 August 2025 — before the rest of the Act's August 2026 deadline. Art.52 obligations are therefore already in force for GPAI model providers. If you train a foundation model, a large language model, a multimodal model, or any other model meeting the GPAI definition, Art.52 compliance is not a future concern — it is a present legal obligation.
For downstream SaaS developers integrating GPAI APIs into products, Art.52 is the regulatory basis for the technical documentation and model cards that upstream GPAI providers must supply. Art.55 flows Art.52 documentation obligations downstream: your GPAI API provider must give you the model card and training data summary that Art.52 requires them to maintain. Understanding Art.52 helps you understand what you are contractually entitled to demand from your GPAI provider and what obligations flow to you when they supply it.
Art.52 in the EU AI Act Structure
Art.52 sits at the second article of Chapter V — General-Purpose AI Models (Art.51–56). It provides the compliance baseline — the obligations that exist independently of systemic risk classification.
| Article | Title | Who It Applies To |
|---|---|---|
| Art.51 | GPAI model classification | Defines who is a GPAI provider and which tier |
| Art.52 | General GPAI model obligations | All GPAI model providers — both tiers |
| Art.53 | Systemic risk GPAI obligations | Systemic risk tier only (adversarial testing, incident reporting, cybersecurity) |
| Art.54 | Authorised representative | Non-EU systemic risk providers only |
| Art.55 | Downstream provider obligations | All GPAI providers — obligations to downstream integrators |
| Art.56 | Code of practice | Systemic risk tier — compliance pathway |
Art.52 provides four baseline obligations:
- Art.52(1)(a) — Draw up, keep up-to-date, and make available to the Commission technical documentation covering model architecture, training methodology, purposes, and capabilities and limitations
- Art.52(1)(a)(i) — Include in technical documentation training data transparency covering the types and sources of training data and, where applicable, geographic coverage and quality assessment
- Art.52(1)(a)(ii) — Include in technical documentation a copyright compliance policy and a summary of the content used for training
- Art.52(1)(b) — Draw up and make available to downstream providers an information document (model card) in machine-readable format
- Art.52(2) — Upon request, provide technical documentation to the Commission and national authorities
- Art.52(3) — Register and publish a summary of training data for public transparency purposes (in the EU database under Art.71)
Art.52(1)(a): Technical Documentation for GPAI Models
Art.52(1)(a) requires every GPAI provider to draw up and maintain technical documentation before placing the model on the EU market. This documentation is distinct from and additional to the Annex IV technical documentation required for high-risk AI systems — it is GPAI-specific and governed by Annex XI.
Annex XI mandatory content elements for GPAI technical documentation:
| Element | Mandatory Content | Practical Implication |
|---|---|---|
| Model architecture | Type of architecture, number of parameters, context window size, modality (text/image/audio/code/multimodal) | Must be documented in the technical file before market placement |
| Training methodology | Pre-training approach (self-supervised / masked language modelling / RLHF / Constitutional AI), fine-tuning steps, alignment procedures | The full training pipeline must be described — not just the final training run |
| Intended purposes | Tasks the model is designed to perform, target use cases, deployment scenarios described in the provider's documentation | The documentation must reflect what the model is marketed and deployed for |
| Capabilities | Performance on standard benchmarks relevant to the model's domain, demonstrated capabilities across task categories | Must be current — if capabilities are updated, the documentation must be updated |
| Limitations | Known failure modes, hallucination rates, bias and fairness assessments, safety benchmarks, capability limitations | Limitations documentation is mandatory, not optional transparency |
| Performance evaluation | Results of internal testing and third-party evaluations, including adversarial probes for general GPAI models | For systemic risk models, this is enhanced by Art.53(1)(b) adversarial testing requirements |
Documentation update obligation:
Art.52(1) requires that documentation is "kept up-to-date." This creates a continuous obligation — not just a one-time pre-market filing. Significant updates trigger documentation revision requirements:
| Update Type | Documentation Obligation |
|---|---|
| New pre-training run (continued training) | Update training methodology + cumulative FLOPs + new capability assessments |
| Fine-tuning update released publicly | Update intended purposes, capabilities, and limitations sections |
| Safety patch or alignment update | Update training methodology and limitations sections |
| New modality added (e.g., adding image understanding to text model) | Full documentation revision as new capabilities are added |
| Benchmark performance change | Update capabilities and performance evaluation sections |
Art.52(1)(a)(i): Training Data Transparency
Art.52(1)(a)(i) requires that the technical documentation includes transparency about training data. This is one of the most commercially sensitive provisions of Art.52 because it requires disclosing information that model providers have historically kept confidential for competitive reasons.
Mandatory training data disclosure elements:
| Element | What Must Be Disclosed | Notes |
|---|---|---|
| Types of data | Categories of data used (web text, books, code, scientific papers, conversation data, images, audio, synthetic data, etc.) | Type-level disclosure — not specific dataset names in all cases |
| Sources of data | Origin of training datasets (Common Crawl, licensed datasets, proprietary data collection, partnerships) | Source-level disclosure — includes whether data was scraped, licensed, or created |
| Geographic coverage | Languages represented, countries of origin of content creators, geographic distribution of training corpus | Relevant for assessing cultural bias and multilingual capability representation |
| Quality assessment | Filtering procedures applied, deduplication methods, quality scoring mechanisms, NSFW/toxicity filtering | Demonstrates diligence in training data curation — relevant to EU copyright and fundamental rights compliance |
Sensitive disclosures — proportionality principle:
Art.52(1)(a)(i) requires disclosure of training data information to the Commission (Art.52(2)) and in summary form publicly (Art.52(3)), but trade secrets protection applies to the confidential portions. The regulation recognises a tension:
| Information Category | Disclosure to Commission | Public Summary (Art.52(3)) |
|---|---|---|
| Training data types | Full disclosure | Required summary |
| Training data sources (general) | Full disclosure | Required summary |
| Specific proprietary dataset names and compositions | Subject to trade secret protection | Aggregate categories only |
| Dataset licensing terms and costs | Trade secret protection applies | Not required in public summary |
| Synthetic data generation methodology | Full disclosure | Summary |
| Geographic distribution statistics | Full disclosure | Required summary |
Bias and fairness dimension of training data transparency:
The geographic coverage and quality assessment requirements implicitly address bias. A GPAI provider must document:
- Which languages are over- or under-represented in training data
- Which geographic regions' web content was included or excluded
- Whether demographic or cultural bias assessments were conducted on the training corpus
This creates an audit trail that regulators — and downstream providers claiming Art.55 rights — can use to evaluate whether the model's capabilities and limitations documentation accurately reflects its training data composition.
Art.52(1)(a)(ii): Copyright Compliance Policy and Training Data Summary
Art.52(1)(a)(ii) is one of the most legally consequential provisions of Art.52. It requires every GPAI provider to:
- Have and document a copyright compliance policy
- Prepare a summary of the content used for training that enables rights holders to exercise their rights
Why Art.52(1)(a)(ii) exists:
The provision intersects with the EU Digital Single Market Directive (DSMD) 2019/790, which created the Text and Data Mining (TDM) exception at Articles 3 and 4. The DSMD framework:
| DSMD Provision | What It Does | Art.52(1)(a)(ii) Connection |
|---|---|---|
| Art.3 DSMD | Mandatory TDM exception for research organisations — cannot be overridden by contract | Research TDM use is presumptively lawful; Art.52 documentation confirms compliance |
| Art.4 DSMD | General TDM exception — but rights holders can opt out using machine-readable reservation | Art.52(1)(a)(ii) requires providers to document how they handle opt-outs |
| Art.4(3) DSMD | Rights holders may reserve TDM rights using machine-readable means | The copyright compliance policy must address how reserved rights are detected and respected |
Copyright compliance policy — mandatory content:
| Policy Element | Required Content |
|---|---|
| TDM opt-out detection | How the provider checks for robots.txt TDM opt-outs, tdmrep.json entries, and other machine-readable reservations |
| Licensed content | Categories of content acquired under licence agreements and the licences held |
| Public domain and open licence content | How CC0, CC-BY, CC-BY-SA, and similar open-licensed content is handled |
| Lawful web scraping | Legal basis for scraping non-opted-out content under Art.4 DSMD TDM exception |
| Dispute resolution | Process for responding to copyright infringement claims from rights holders |
| Ongoing compliance monitoring | How the provider monitors for new TDM opt-outs and updates training pipelines |
Training data summary for rights holders:
The "summary of content used for training" serves a specific purpose: it enables rights holders to identify whether their works were included in training data and to exercise their rights (including claims under DSMD and national copyright law). The summary must be:
- Detailed enough to allow plausible identification of works or categories of works
- Accessible to rights holders — this implies it must be findable and understandable without specialist knowledge
- Current — must be updated when new training data is added
EU-wide harmonisation: The EU AI Office is developing guidance on what constitutes a compliant copyright compliance policy and a sufficient training data summary. Providers should monitor AI Office publications at ai.ec.europa.eu for updated requirements.
Art.52(1)(b): Machine-Readable Model Card for Downstream Providers
Art.52(1)(b) requires every GPAI provider to draw up an information document — commonly called a model card — and make it available to downstream providers who integrate the GPAI model into their own AI systems. This is the primary mechanism by which Art.52 obligations flow downstream via Art.55.
Machine-readable format requirement:
The model card must be machine-readable, not just human-readable. This enables:
- Automated compliance checking by downstream developers integrating the model via API
- Integration into AI system registries and EU AI Act compliance tooling
- Version tracking and automated notification when upstream model documentation changes
Minimum model card content under Art.52(1)(b):
| Section | Content Required | Format Guidance |
|---|---|---|
| Model identity | Provider name, model name, version, release date, model type (GPAI / GPAI with systemic risk) | Stable identifiers; version-controlled |
| Art.51 classification | Tier classification (general GPAI or systemic risk), designated or threshold-based | Binary field + supporting evidence reference |
| Intended purposes | Approved use cases, supported languages, input/output modalities | List format; version-controlled |
| Capabilities | Performance summary across relevant benchmarks, demonstrated task competence | Benchmark name + score + version of benchmark |
| Limitations | Known failure modes, hallucination characteristics, bias dimensions, capability gaps | Structured list; severity-tagged |
| Training data summary | Summary of training data types, sources, geographic coverage — from Art.52(1)(a)(i) | Reference to full Art.52(3) public summary |
| Copyright compliance | Pointer to copyright compliance policy from Art.52(1)(a)(ii) | URL to policy document; version-tracked |
| Art.55 downstream obligations | What obligations downstream integrators have when integrating this model | Checklist format for downstream compliance |
| Update notification | How downstream providers will be notified of material model changes | Webhook URL or notification API endpoint |
| Contact point | GPAI provider contact for compliance inquiries from downstream providers | Email or API endpoint |
Machine-readable format options:
The AI Act does not mandate a specific schema, but emerging standards include:
- Croissant (ML Commons format for ML datasets — applicable to training data documentation)
- JSON Schema-based model cards (Google model cards, Hugging Face model card format)
- SPDX (for training data licensing information)
- Schema.org extensions for AI systems
Providers should choose a format that can be consumed by compliance automation tools and is compatible with the EU AI Office's API-based registration system under Art.71.
Art.52(2): Commission Access Obligation
Art.52(2) grants the European Commission the right to request access to GPAI technical documentation. This is a reactive right (Commission requests access) rather than a pro-active submission obligation — but it has significant compliance implications.
When Art.52(2) is triggered:
| Scenario | Commission Action | Provider Response Obligation |
|---|---|---|
| Routine compliance monitoring | Commission requests documentation review | Provider must provide within the specified timeframe |
| Systemic risk assessment | Commission is evaluating potential Art.51(2) designation of a model | Provider must provide detailed technical documentation to support or rebut assessment |
| Serious incident investigation | Commission or AI Office is investigating an incident involving the GPAI model | Priority access; provider must cooperate under Art.52(2) + Art.21 |
| GPAI code of practice review | Commission is assessing adequacy of code of practice measures | Technical documentation is central evidence |
Practical implications for compliance infrastructure:
Art.52(2) creates an ongoing obligation to maintain documentation in a form that can be produced on request. This means:
- Technical documentation must be version-controlled and auditable
- Previous versions must be retained — the Commission may request historical documentation to assess when a model crossed a capability or compute threshold
- Documentation must be in a form that can be shared with the Commission (translated into an EU language if required)
- Confidentiality protections (trade secrets) apply but cannot be used to refuse access — they affect the handling of disclosed information, not the disclosure obligation itself
Art.52(2) vs. Art.21 (MSA cooperation for high-risk AI):
Art.52(2) applies to GPAI models specifically; Art.21 applies to high-risk AI systems. A GPAI model integrated into a high-risk AI system may be subject to both obligations simultaneously — the high-risk AI system integrator cooperates under Art.21, and the upstream GPAI provider cooperates under Art.52(2).
Art.52(3): Public Training Data Summary
Art.52(3) requires GPAI providers to register a summary of the content used for training in the EU AI database under Art.71 and make it publicly accessible. This is the public-facing obligation — distinct from the Commission-only documentation under Art.52(2).
What the public summary must contain:
| Element | Required in Public Summary | Confidentiality Protection |
|---|---|---|
| Training data types | Yes — all major categories | No protection — public interest disclosure |
| Training data sources (general) | Yes — at category level | Trade secrets on specific proprietary sources |
| Geographic coverage | Yes — by language and region | No protection |
| Quality filtering procedures | Summary description | Specific algorithms may have trade secret protection |
| Copyright compliance measures | Summary — "complies with EU copyright law via the following measures" | Policy document referenced; details protected |
| Data collection timeframe | Yes — training data vintage/cutoff | No protection |
| Opt-out respect measures | Yes — confirmation of TDM opt-out compliance | No protection |
Registration in the EU AI database:
Art.52(3) links to Art.71 — the EU AI Office maintains a central database of AI systems and GPAI models. GPAI providers must register their model and publish the training data summary through this database. The AI Office provides the registration interface and has published technical guidelines on submission format.
Update obligation:
If a new version of the GPAI model is trained on additional data, the public summary must be updated to reflect the extended training data. The EU AI database is version-aware — providers must submit updated summaries tied to each major model version.
Art.52 × Art.55: Downstream Information Flow Chain
Art.52 creates documentation obligations on GPAI providers; Art.55 transmits those obligations downstream to providers who integrate GPAI models into their AI systems. This creates a two-tier information chain:
| Layer | Actor | Obligation |
|---|---|---|
| Tier 1 (GPAI Provider) | Foundation model provider | Draws up Art.52 technical documentation, model card, copyright policy, training data summary |
| Tier 2 (Downstream Provider) | SaaS developer integrating GPAI API | Receives Art.52 model card under Art.55; uses it to populate their own Art.11/Annex IV technical documentation for any high-risk AI system built on the GPAI model |
| End User | Deployer or user of downstream AI product | Benefits from transparency chain; can exercise rights under Art.86 (right to explanation) based on information documented through Art.52 → Art.55 chain |
What downstream providers are entitled to receive under Art.55:
When you integrate a GPAI API into a product — particularly a high-risk AI system under Annex III — you are entitled to receive from your GPAI API provider:
- A copy of the machine-readable model card (Art.52(1)(b))
- A reference to the publicly available training data summary (Art.52(3))
- A reference to the copyright compliance policy (Art.52(1)(a)(ii))
- The Art.51 classification status of the underlying model (general GPAI or systemic risk)
- For systemic risk models: notification of classification change and related obligations
Contractual enforcement of Art.55 rights:
Art.55 does not create an automatic legal transfer — it must be implemented through contracts between GPAI providers and downstream developers. Downstream developers should:
- Include explicit Art.52/Art.55 model card provision obligations in API terms of service contracts
- Specify the required format (machine-readable) and update frequency
- Include provisions for notification of material model changes that affect the model card
CLOUD Act × Art.52 Technical Documentation
Art.52 creates documentation that is highly sensitive from a legal and competitive perspective — and that documentation is directly at risk from CLOUD Act compellability when stored on US cloud infrastructure.
At-risk documentation under Art.52:
| Documentation | Art.52 Requirement | CLOUD Act Risk |
|---|---|---|
| Cumulative training compute logs | Required for Commission notification and technical documentation | Compellable by US law enforcement — potential disclosure of proprietary training methodology |
| Training dataset composition | Art.52(1)(a)(i) — types and sources of data | Commercially sensitive; training data selection is a core competitive differentiator |
| Copyright compliance records | Art.52(1)(a)(ii) — policy and data usage summary | Litigation-relevant; copyright disputes use these records as evidence |
| Capability and limitations documentation | Art.52(1)(a) — benchmarks and known failures | Competitive intelligence; discloses safety vulnerabilities |
| Model card versions | Art.52(1)(b) — machine-readable model card history | Version history shows capability evolution; competitive intelligence |
| Training data opt-out compliance logs | Part of copyright compliance policy | Copyright litigation evidence; dual-access risk |
Dual-compellability scenario for GPAI providers:
A GPAI provider operating primarily on US cloud infrastructure (AWS, Azure, GCP) faces the following simultaneous obligations:
- EU Art.52(2): Commission can request access to technical documentation — to be provided to EU authorities
- CLOUD Act: US law enforcement can compel access to the same records stored on US infrastructure — without an EU court order
These two obligations can conflict: the GPAI provider is simultaneously obligated to protect commercially sensitive documentation for EU regulatory purposes and subject to compelled disclosure to US law enforcement.
EU-native infrastructure as structural mitigation:
Storing GPAI compliance documentation — technical files, training data transparency reports, copyright compliance records, model card versions — on EU-native infrastructure (incorporated in the EU, operating under GDPR and EU administrative law) eliminates the CLOUD Act compellability risk. The records are subject to EU judicial process only, and access by non-EU authorities requires Mutual Legal Assistance Treaty (MLAT) procedures that provide legal visibility.
For developers building compliance infrastructure for GPAI providers — model registries, technical documentation management systems, copyright compliance tracking systems — EU-native PaaS deployment is an architectural requirement to achieve single-jurisdiction compliance.
Python Implementation
GPAITechnicalDocumentationRecord
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from datetime import date
class GPAITier(Enum):
GENERAL = "general_gpai"
SYSTEMIC_RISK = "systemic_risk_gpai"
class DocumentationStatus(Enum):
DRAFT = "draft"
COMPLETE = "complete"
UNDER_REVIEW = "under_review"
APPROVED = "approved"
@dataclass
class GPAITechnicalDocumentationRecord:
"""Art.52(1)(a) — GPAI model technical documentation record."""
model_name: str
model_version: str
provider_name: str
tier: GPAITier
architecture_type: str # e.g., "transformer", "diffusion", "mixture-of-experts"
parameter_count: Optional[int] # May be confidential — None if not disclosed
context_window: Optional[int] # Token context window
modalities: list[str] # e.g., ["text", "image", "code"]
training_methodology: str # Pre-training approach description
fine_tuning_steps: list[str] # List of fine-tuning and alignment procedures
intended_purposes: list[str] # Documented intended use cases
supported_languages: list[str] # BCP 47 language codes
capabilities_summary: str # Narrative capabilities description
limitations_summary: str # Narrative limitations and known failure modes
benchmark_results: dict[str, float] = field(default_factory=dict) # name -> score
safety_evaluations: list[str] = field(default_factory=list)
cumulative_training_flops: Optional[float] = None # 10^25 threshold reference
documentation_version: str = "1.0.0"
documentation_date: str = ""
status: DocumentationStatus = DocumentationStatus.DRAFT
def __post_init__(self):
if not self.documentation_date:
self.documentation_date = date.today().isoformat()
def is_systemic_risk_threshold_met(self) -> bool:
"""Check if cumulative training compute exceeds Art.51 threshold."""
if self.cumulative_training_flops is None:
return False
return self.cumulative_training_flops >= 1e25
def validate_completeness(self) -> list[str]:
"""Return list of missing mandatory Art.52(1)(a) elements."""
gaps = []
if not self.architecture_type:
gaps.append("Art.52(1)(a): Model architecture type not documented")
if not self.training_methodology:
gaps.append("Art.52(1)(a): Training methodology not documented")
if not self.intended_purposes:
gaps.append("Art.52(1)(a): Intended purposes not documented")
if not self.capabilities_summary:
gaps.append("Art.52(1)(a): Capabilities summary not documented")
if not self.limitations_summary:
gaps.append("Art.52(1)(a): Limitations summary not documented")
if not self.benchmark_results:
gaps.append("Art.52(1)(a): No benchmark results documented")
if not self.safety_evaluations:
gaps.append("Art.52(1)(a): No safety evaluation results documented")
return gaps
def to_commission_submission(self) -> dict:
"""Prepare record for Art.52(2) Commission access submission."""
return {
"provider": self.provider_name,
"model": f"{self.model_name} v{self.model_version}",
"tier": self.tier.value,
"architecture": self.architecture_type,
"modalities": self.modalities,
"methodology": self.training_methodology,
"purposes": self.intended_purposes,
"capabilities": self.capabilities_summary,
"limitations": self.limitations_summary,
"benchmarks": self.benchmark_results,
"safety": self.safety_evaluations,
"compute": self.cumulative_training_flops,
"version": self.documentation_version,
"date": self.documentation_date,
}
TrainingDataTransparencyReport
@dataclass
class TrainingDataSource:
"""A single training data source entry."""
source_name: str # e.g., "Common Crawl", "Books3", "GitHub"
data_type: str # e.g., "web_text", "books", "code", "scientific_papers"
collection_method: str # "scraped", "licensed", "proprietary", "synthetic"
approximate_size_tokens: Optional[int] = None
languages: list[str] = field(default_factory=list)
geographic_coverage: list[str] = field(default_factory=list) # ISO 3166-1 alpha-2 codes
collection_period_start: Optional[str] = None # ISO date
collection_period_end: Optional[str] = None # ISO date
tdm_opt_out_checked: bool = False
licence_held: Optional[str] = None # SPDX identifier or "proprietary"
is_confidential: bool = False # Trade secret protection flag
@dataclass
class TrainingDataTransparencyReport:
"""Art.52(1)(a)(i) — Training data transparency report."""
model_name: str
model_version: str
report_date: str
sources: list[TrainingDataSource]
total_training_tokens: Optional[int] = None
quality_filtering_methods: list[str] = field(default_factory=list)
deduplication_methods: list[str] = field(default_factory=list)
toxicity_filtering: bool = False
nsfw_filtering: bool = False
pii_filtering: bool = False
def get_geographic_distribution(self) -> dict[str, int]:
"""Aggregate geographic coverage across all sources."""
distribution: dict[str, int] = {}
for source in self.sources:
for country in source.geographic_coverage:
distribution[country] = distribution.get(country, 0) + 1
return distribution
def get_language_distribution(self) -> dict[str, int]:
"""Aggregate language coverage across all sources."""
distribution: dict[str, int] = {}
for source in self.sources:
for lang in source.languages:
distribution[lang] = distribution.get(lang, 0) + 1
return distribution
def get_data_type_distribution(self) -> dict[str, list[str]]:
"""Group sources by data type."""
by_type: dict[str, list[str]] = {}
for source in self.sources:
if source.data_type not in by_type:
by_type[source.data_type] = []
by_type[source.data_type].append(source.source_name)
return by_type
def generate_public_summary(self) -> dict:
"""Generate Art.52(3) public summary — excludes confidential source details."""
public_sources = []
for source in self.sources:
if not source.is_confidential:
public_sources.append({
"type": source.data_type,
"collection_method": source.collection_method,
"languages": source.languages,
"geographic_coverage": source.geographic_coverage,
"period": f"{source.collection_period_start} — {source.collection_period_end}",
"tdm_opt_out_respected": source.tdm_opt_out_checked,
})
return {
"model": f"{self.model_name} v{self.model_version}",
"report_date": self.report_date,
"total_tokens": self.total_training_tokens,
"data_sources_count": len(self.sources),
"public_sources": public_sources,
"geographic_distribution": self.get_geographic_distribution(),
"language_distribution": self.get_language_distribution(),
"quality_measures": {
"filtering": self.quality_filtering_methods,
"deduplication": self.deduplication_methods,
"toxicity_filtering": self.toxicity_filtering,
"nsfw_filtering": self.nsfw_filtering,
"pii_filtering": self.pii_filtering,
}
}
def check_tdm_opt_out_compliance(self) -> list[str]:
"""Identify sources where TDM opt-out was not checked."""
non_compliant = []
for source in self.sources:
if source.collection_method == "scraped" and not source.tdm_opt_out_checked:
non_compliant.append(
f"Art.52(1)(a)(i)/(ii): {source.source_name} — scraped source, TDM opt-out not verified"
)
return non_compliant
CopyrightCompliancePolicy
from enum import Enum
class TDMLegalBasis(Enum):
ART3_DSMD_RESEARCH = "art3_dsmd_research_exception" # Art.3 DSMD — mandatory research exception
ART4_DSMD_GENERAL = "art4_dsmd_general_exception" # Art.4 DSMD — general TDM exception (opt-out possible)
LICENSED = "licensed" # Content obtained under licence
PUBLIC_DOMAIN = "public_domain" # Copyright expired or CC0
OPEN_LICENCE = "open_licence" # CC-BY, CC-BY-SA, Apache, etc.
PROPRIETARY = "proprietary" # Provider-created content
@dataclass
class CopyrightCompliancePolicy:
"""Art.52(1)(a)(ii) — Copyright compliance policy for GPAI training data."""
provider_name: str
model_name: str
policy_version: str
effective_date: str
# Legal bases used in training data collection
legal_bases_used: list[TDMLegalBasis]
# TDM opt-out detection
tdm_opt_out_detection_method: str # e.g., "robots.txt + tdmrep.json automated check"
tdm_opt_out_update_frequency: str # e.g., "weekly recrawl, retroactive removal on detection"
# Licence compliance
licence_categories: list[str] # Types of licences held (not specific licences — trade secret)
licence_audit_frequency: str # How often licences are reviewed for compliance
# Dispute resolution
infringement_report_contact: str # Email or API endpoint for copyright claims
response_sla: str # Response time for infringement reports
removal_procedure: str # How content is removed from training data on valid claim
# Ongoing monitoring
new_optout_monitoring: bool = True
new_licence_monitoring: bool = True
retroactive_compliance_review: bool = False
def validate(self) -> list[str]:
"""Return compliance gaps in the copyright policy."""
gaps = []
if TDMLegalBasis.ART4_DSMD_GENERAL in self.legal_bases_used:
if not self.tdm_opt_out_detection_method:
gaps.append("Art.52(1)(a)(ii): Art.4 DSMD TDM exception used but no opt-out detection method documented")
if not self.infringement_report_contact:
gaps.append("Art.52(1)(a)(ii): No infringement report contact documented")
if not self.removal_procedure:
gaps.append("Art.52(1)(a)(ii): No content removal procedure documented")
return gaps
def to_public_summary(self) -> dict:
"""Generate public-facing copyright compliance summary."""
return {
"provider": self.provider_name,
"model": self.model_name,
"policy_version": self.policy_version,
"effective_date": self.effective_date,
"legal_bases": [lb.value for lb in self.legal_bases_used],
"tdm_opt_out_respected": TDMLegalBasis.ART4_DSMD_GENERAL in self.legal_bases_used,
"opt_out_detection": self.tdm_opt_out_detection_method,
"opt_out_update_frequency": self.tdm_opt_out_update_frequency,
"infringement_contact": self.infringement_report_contact,
"response_sla": self.response_sla,
}
Art.52 Compliance Checklist (40 Items)
Technical Documentation — Art.52(1)(a)
- Art.52-1 — Technical documentation drawn up before market placement, covering all Annex XI elements
- Art.52-2 — Model architecture type and key parameters documented (modality, architecture class, context window)
- Art.52-3 — Training methodology documented (pre-training approach, alignment procedures, fine-tuning steps)
- Art.52-4 — Intended purposes documented with specific supported use cases
- Art.52-5 — Capabilities documented with benchmark results on domain-relevant evaluations
- Art.52-6 — Limitations documented including known failure modes, hallucination characteristics, and known biases
- Art.52-7 — Performance on safety-relevant benchmarks documented (e.g., TruthfulQA, HarmBench, MMLU)
- Art.52-8 — Documentation update procedure defined — triggering events for revision identified
- Art.52-9 — Documentation version control in place — previous versions retained and auditable
- Art.52-10 — Documentation accessible for Commission review (Art.52(2)) — production-ready format
Training Data Transparency — Art.52(1)(a)(i)
- Art.52-11 — Training data types documented — all major data categories identified
- Art.52-12 — Training data sources documented at category level (web, books, code, synthetic, licensed, etc.)
- Art.52-13 — Geographic coverage documented — languages and countries of origin identified
- Art.52-14 — Training data quality assessment documented — filtering and deduplication methods described
- Art.52-15 — Training data collection period documented — from/to dates for each major source
- Art.52-16 — Synthetic data components identified and described separately
- Art.52-17 — PII filtering procedures documented (GDPR Art.6 lawful basis for any personal data)
- Art.52-18 — Toxicity and NSFW filtering methods documented
Copyright Compliance — Art.52(1)(a)(ii)
- Art.52-19 — Copyright compliance policy drafted and approved
- Art.52-20 — Legal basis for each training data source identified (DSMD Art.3/Art.4, licence, public domain)
- Art.52-21 — TDM opt-out detection method documented (robots.txt, tdmrep.json, other machine-readable signals)
- Art.52-22 — TDM opt-out update frequency defined — how often scraped sources are rechecked
- Art.52-23 — Retroactive removal procedure defined — process for removing content when opt-out detected post-training
- Art.52-24 — Licence inventory maintained — all licenced content categories and licences held
- Art.52-25 — Licence compliance audit frequency defined
- Art.52-26 — Copyright infringement claim contact point established and operational
- Art.52-27 — Response SLA for infringement claims defined (e.g., 30-day response)
- Art.52-28 — Training data summary for rights holders prepared — sufficient detail to allow work identification
Machine-Readable Model Card — Art.52(1)(b)
- Art.52-29 — Model card drawn up in machine-readable format (JSON, YAML, or schema-compliant format)
- Art.52-30 — Model card includes Art.51 classification status (general GPAI or systemic risk)
- Art.52-31 — Model card includes intended purposes, supported modalities, and language coverage
- Art.52-32 — Model card includes capabilities summary and benchmark results references
- Art.52-33 — Model card includes limitations and known failure modes
- Art.52-34 — Model card references training data summary (Art.52(3)) and copyright policy (Art.52(1)(a)(ii))
- Art.52-35 — Model card includes downstream Art.55 obligations for API integrators
- Art.52-36 — Model card version-controlled — downstream providers notified on material update
Commission Access and Public Summary — Art.52(2)/(3)
- Art.52-37 — Documentation production procedure defined for Art.52(2) Commission requests
- Art.52-38 — Response timeline for Commission documentation requests defined and feasible
- Art.52-39 — Public training data summary registered in EU AI database (Art.71) via AI Office portal
- Art.52-40 — Public summary update procedure defined — triggered by new major model version
See Also
- EU AI Act Art.51 GPAI Model Classification: Systemic Risk Threshold and Provider Obligations — Art.51 is the classification gateway that determines which tier of Art.52 obligations applies to your GPAI model
- EU AI Act Art.29 GPAI Provider Obligations: Developer Guide — Art.29 maps the downstream information obligations that Art.52 technical documentation and model cards flow into via Art.55
- EU AI Act 2026: Conformity Assessment Guide for PaaS and SaaS Developers — Art.52 technical documentation forms part of the conformity assessment record when GPAI models are integrated into high-risk AI systems
- EU AI Act Art.11 Technical Documentation: Developer Guide (High-Risk AI) — Art.11 high-risk AI technical documentation requirements run in parallel with Art.52 for GPAI models integrated into high-risk AI systems
- EU AI Liability Directive (AILD) + Product Liability Directive 2024: Developer Guide — Art.52 training data transparency and copyright compliance records directly affect AILD evidentiary obligations in infringement claims