2026-04-16·12 min read·

EU AI Act Art.52 GPAI Model General Obligations: Technical Documentation, Training Data & Copyright — Developer Guide (2026)

EU AI Act Chapter V imposes a tiered obligation structure on General-Purpose AI (GPAI) model providers. Article 51 classifies GPAI models into two tiers based on training compute — general GPAI models and GPAI models with systemic risk. Article 52 establishes the baseline obligations that apply to every GPAI model provider regardless of tier. These are not optional enhancements: they are the minimum compliance floor for any organisation that trains, fine-tunes, or distributes a GPAI model in or into the EU market.

Chapter V of the EU AI Act became applicable on 2 August 2025 — before the rest of the Act's August 2026 deadline. Art.52 obligations are therefore already in force for GPAI model providers. If you train a foundation model, a large language model, a multimodal model, or any other model meeting the GPAI definition, Art.52 compliance is not a future concern — it is a present legal obligation.

For downstream SaaS developers integrating GPAI APIs into products, Art.52 is the regulatory basis for the technical documentation and model cards that upstream GPAI providers must supply. Art.55 flows Art.52 documentation obligations downstream: your GPAI API provider must give you the model card and training data summary that Art.52 requires them to maintain. Understanding Art.52 helps you understand what you are contractually entitled to demand from your GPAI provider and what obligations flow to you when they supply it.

Art.52 in the EU AI Act Structure

Art.52 sits at the second article of Chapter V — General-Purpose AI Models (Art.51–56). It provides the compliance baseline — the obligations that exist independently of systemic risk classification.

Article	Title	Who It Applies To
Art.51	GPAI model classification	Defines who is a GPAI provider and which tier
Art.52	General GPAI model obligations	All GPAI model providers — both tiers
Art.53	Systemic risk GPAI obligations	Systemic risk tier only (adversarial testing, incident reporting, cybersecurity)
Art.54	Authorised representative	Non-EU systemic risk providers only
Art.55	Downstream provider obligations	All GPAI providers — obligations to downstream integrators
Art.56	Code of practice	Systemic risk tier — compliance pathway

Art.52 provides four baseline obligations:

Art.52(1)(a) — Draw up, keep up-to-date, and make available to the Commission technical documentation covering model architecture, training methodology, purposes, and capabilities and limitations
Art.52(1)(a)(i) — Include in technical documentation training data transparency covering the types and sources of training data and, where applicable, geographic coverage and quality assessment
Art.52(1)(a)(ii) — Include in technical documentation a copyright compliance policy and a summary of the content used for training
Art.52(1)(b) — Draw up and make available to downstream providers an information document (model card) in machine-readable format
Art.52(2) — Upon request, provide technical documentation to the Commission and national authorities
Art.52(3) — Register and publish a summary of training data for public transparency purposes (in the EU database under Art.71)

Art.52(1)(a): Technical Documentation for GPAI Models

Art.52(1)(a) requires every GPAI provider to draw up and maintain technical documentation before placing the model on the EU market. This documentation is distinct from and additional to the Annex IV technical documentation required for high-risk AI systems — it is GPAI-specific and governed by Annex XI.

Annex XI mandatory content elements for GPAI technical documentation:

Element	Mandatory Content	Practical Implication
Model architecture	Type of architecture, number of parameters, context window size, modality (text/image/audio/code/multimodal)	Must be documented in the technical file before market placement
Training methodology	Pre-training approach (self-supervised / masked language modelling / RLHF / Constitutional AI), fine-tuning steps, alignment procedures	The full training pipeline must be described — not just the final training run
Intended purposes	Tasks the model is designed to perform, target use cases, deployment scenarios described in the provider's documentation	The documentation must reflect what the model is marketed and deployed for
Capabilities	Performance on standard benchmarks relevant to the model's domain, demonstrated capabilities across task categories	Must be current — if capabilities are updated, the documentation must be updated
Limitations	Known failure modes, hallucination rates, bias and fairness assessments, safety benchmarks, capability limitations	Limitations documentation is mandatory, not optional transparency
Performance evaluation	Results of internal testing and third-party evaluations, including adversarial probes for general GPAI models	For systemic risk models, this is enhanced by Art.53(1)(b) adversarial testing requirements

Documentation update obligation:

Art.52(1) requires that documentation is "kept up-to-date." This creates a continuous obligation — not just a one-time pre-market filing. Significant updates trigger documentation revision requirements:

Update Type	Documentation Obligation
New pre-training run (continued training)	Update training methodology + cumulative FLOPs + new capability assessments
Fine-tuning update released publicly	Update intended purposes, capabilities, and limitations sections
Safety patch or alignment update	Update training methodology and limitations sections
New modality added (e.g., adding image understanding to text model)	Full documentation revision as new capabilities are added
Benchmark performance change	Update capabilities and performance evaluation sections

Art.52(1)(a)(i): Training Data Transparency

Art.52(1)(a)(i) requires that the technical documentation includes transparency about training data. This is one of the most commercially sensitive provisions of Art.52 because it requires disclosing information that model providers have historically kept confidential for competitive reasons.

Mandatory training data disclosure elements:

Element	What Must Be Disclosed	Notes
Types of data	Categories of data used (web text, books, code, scientific papers, conversation data, images, audio, synthetic data, etc.)	Type-level disclosure — not specific dataset names in all cases
Sources of data	Origin of training datasets (Common Crawl, licensed datasets, proprietary data collection, partnerships)	Source-level disclosure — includes whether data was scraped, licensed, or created
Geographic coverage	Languages represented, countries of origin of content creators, geographic distribution of training corpus	Relevant for assessing cultural bias and multilingual capability representation
Quality assessment	Filtering procedures applied, deduplication methods, quality scoring mechanisms, NSFW/toxicity filtering	Demonstrates diligence in training data curation — relevant to EU copyright and fundamental rights compliance

Sensitive disclosures — proportionality principle:

Art.52(1)(a)(i) requires disclosure of training data information to the Commission (Art.52(2)) and in summary form publicly (Art.52(3)), but trade secrets protection applies to the confidential portions. The regulation recognises a tension:

Information Category	Disclosure to Commission	Public Summary (Art.52(3))
Training data types	Full disclosure	Required summary
Training data sources (general)	Full disclosure	Required summary
Specific proprietary dataset names and compositions	Subject to trade secret protection	Aggregate categories only
Dataset licensing terms and costs	Trade secret protection applies	Not required in public summary
Synthetic data generation methodology	Full disclosure	Summary
Geographic distribution statistics	Full disclosure	Required summary

Bias and fairness dimension of training data transparency:

The geographic coverage and quality assessment requirements implicitly address bias. A GPAI provider must document:

Which languages are over- or under-represented in training data
Which geographic regions' web content was included or excluded
Whether demographic or cultural bias assessments were conducted on the training corpus

This creates an audit trail that regulators — and downstream providers claiming Art.55 rights — can use to evaluate whether the model's capabilities and limitations documentation accurately reflects its training data composition.

Art.52(1)(a)(ii): Copyright Compliance Policy and Training Data Summary

Art.52(1)(a)(ii) is one of the most legally consequential provisions of Art.52. It requires every GPAI provider to:

Have and document a copyright compliance policy
Prepare a summary of the content used for training that enables rights holders to exercise their rights

Why Art.52(1)(a)(ii) exists:

The provision intersects with the EU Digital Single Market Directive (DSMD) 2019/790, which created the Text and Data Mining (TDM) exception at Articles 3 and 4. The DSMD framework:

DSMD Provision	What It Does	Art.52(1)(a)(ii) Connection
Art.3 DSMD	Mandatory TDM exception for research organisations — cannot be overridden by contract	Research TDM use is presumptively lawful; Art.52 documentation confirms compliance
Art.4 DSMD	General TDM exception — but rights holders can opt out using machine-readable reservation	Art.52(1)(a)(ii) requires providers to document how they handle opt-outs
Art.4(3) DSMD	Rights holders may reserve TDM rights using machine-readable means	The copyright compliance policy must address how reserved rights are detected and respected

Copyright compliance policy — mandatory content:

Policy Element	Required Content
TDM opt-out detection	How the provider checks for `robots.txt` TDM opt-outs, `tdmrep.json` entries, and other machine-readable reservations
Licensed content	Categories of content acquired under licence agreements and the licences held
Public domain and open licence content	How CC0, CC-BY, CC-BY-SA, and similar open-licensed content is handled
Lawful web scraping	Legal basis for scraping non-opted-out content under Art.4 DSMD TDM exception
Dispute resolution	Process for responding to copyright infringement claims from rights holders
Ongoing compliance monitoring	How the provider monitors for new TDM opt-outs and updates training pipelines

Training data summary for rights holders:

The "summary of content used for training" serves a specific purpose: it enables rights holders to identify whether their works were included in training data and to exercise their rights (including claims under DSMD and national copyright law). The summary must be:

Detailed enough to allow plausible identification of works or categories of works
Accessible to rights holders — this implies it must be findable and understandable without specialist knowledge
Current — must be updated when new training data is added

EU-wide harmonisation: The EU AI Office is developing guidance on what constitutes a compliant copyright compliance policy and a sufficient training data summary. Providers should monitor AI Office publications at ai.ec.europa.eu for updated requirements.

Art.52(1)(b): Machine-Readable Model Card for Downstream Providers

Art.52(1)(b) requires every GPAI provider to draw up an information document — commonly called a model card — and make it available to downstream providers who integrate the GPAI model into their own AI systems. This is the primary mechanism by which Art.52 obligations flow downstream via Art.55.

Machine-readable format requirement:

The model card must be machine-readable, not just human-readable. This enables:

Automated compliance checking by downstream developers integrating the model via API
Integration into AI system registries and EU AI Act compliance tooling
Version tracking and automated notification when upstream model documentation changes

Minimum model card content under Art.52(1)(b):

Section	Content Required	Format Guidance
Model identity	Provider name, model name, version, release date, model type (GPAI / GPAI with systemic risk)	Stable identifiers; version-controlled
Art.51 classification	Tier classification (general GPAI or systemic risk), designated or threshold-based	Binary field + supporting evidence reference
Intended purposes	Approved use cases, supported languages, input/output modalities	List format; version-controlled
Capabilities	Performance summary across relevant benchmarks, demonstrated task competence	Benchmark name + score + version of benchmark
Limitations	Known failure modes, hallucination characteristics, bias dimensions, capability gaps	Structured list; severity-tagged
Training data summary	Summary of training data types, sources, geographic coverage — from Art.52(1)(a)(i)	Reference to full Art.52(3) public summary
Copyright compliance	Pointer to copyright compliance policy from Art.52(1)(a)(ii)	URL to policy document; version-tracked
Art.55 downstream obligations	What obligations downstream integrators have when integrating this model	Checklist format for downstream compliance
Update notification	How downstream providers will be notified of material model changes	Webhook URL or notification API endpoint
Contact point	GPAI provider contact for compliance inquiries from downstream providers	Email or API endpoint

Machine-readable format options:

The AI Act does not mandate a specific schema, but emerging standards include:

Croissant (ML Commons format for ML datasets — applicable to training data documentation)
JSON Schema-based model cards (Google model cards, Hugging Face model card format)
SPDX (for training data licensing information)
Schema.org extensions for AI systems

Providers should choose a format that can be consumed by compliance automation tools and is compatible with the EU AI Office's API-based registration system under Art.71.

Art.52(2): Commission Access Obligation

Art.52(2) grants the European Commission the right to request access to GPAI technical documentation. This is a reactive right (Commission requests access) rather than a pro-active submission obligation — but it has significant compliance implications.

When Art.52(2) is triggered:

Scenario	Commission Action	Provider Response Obligation
Routine compliance monitoring	Commission requests documentation review	Provider must provide within the specified timeframe
Systemic risk assessment	Commission is evaluating potential Art.51(2) designation of a model	Provider must provide detailed technical documentation to support or rebut assessment
Serious incident investigation	Commission or AI Office is investigating an incident involving the GPAI model	Priority access; provider must cooperate under Art.52(2) + Art.21
GPAI code of practice review	Commission is assessing adequacy of code of practice measures	Technical documentation is central evidence

Practical implications for compliance infrastructure:

Art.52(2) creates an ongoing obligation to maintain documentation in a form that can be produced on request. This means:

Technical documentation must be version-controlled and auditable
Previous versions must be retained — the Commission may request historical documentation to assess when a model crossed a capability or compute threshold
Documentation must be in a form that can be shared with the Commission (translated into an EU language if required)
Confidentiality protections (trade secrets) apply but cannot be used to refuse access — they affect the handling of disclosed information, not the disclosure obligation itself

Art.52(2) vs. Art.21 (MSA cooperation for high-risk AI):

Art.52(2) applies to GPAI models specifically; Art.21 applies to high-risk AI systems. A GPAI model integrated into a high-risk AI system may be subject to both obligations simultaneously — the high-risk AI system integrator cooperates under Art.21, and the upstream GPAI provider cooperates under Art.52(2).

Art.52(3): Public Training Data Summary

Art.52(3) requires GPAI providers to register a summary of the content used for training in the EU AI database under Art.71 and make it publicly accessible. This is the public-facing obligation — distinct from the Commission-only documentation under Art.52(2).

What the public summary must contain:

Element	Required in Public Summary	Confidentiality Protection
Training data types	Yes — all major categories	No protection — public interest disclosure
Training data sources (general)	Yes — at category level	Trade secrets on specific proprietary sources
Geographic coverage	Yes — by language and region	No protection
Quality filtering procedures	Summary description	Specific algorithms may have trade secret protection
Copyright compliance measures	Summary — "complies with EU copyright law via the following measures"	Policy document referenced; details protected
Data collection timeframe	Yes — training data vintage/cutoff	No protection
Opt-out respect measures	Yes — confirmation of TDM opt-out compliance	No protection

Registration in the EU AI database:

Art.52(3) links to Art.71 — the EU AI Office maintains a central database of AI systems and GPAI models. GPAI providers must register their model and publish the training data summary through this database. The AI Office provides the registration interface and has published technical guidelines on submission format.

Update obligation:

If a new version of the GPAI model is trained on additional data, the public summary must be updated to reflect the extended training data. The EU AI database is version-aware — providers must submit updated summaries tied to each major model version.

Art.52 × Art.55: Downstream Information Flow Chain

Art.52 creates documentation obligations on GPAI providers; Art.55 transmits those obligations downstream to providers who integrate GPAI models into their AI systems. This creates a two-tier information chain:

Layer	Actor	Obligation
Tier 1 (GPAI Provider)	Foundation model provider	Draws up Art.52 technical documentation, model card, copyright policy, training data summary
Tier 2 (Downstream Provider)	SaaS developer integrating GPAI API	Receives Art.52 model card under Art.55; uses it to populate their own Art.11/Annex IV technical documentation for any high-risk AI system built on the GPAI model
End User	Deployer or user of downstream AI product	Benefits from transparency chain; can exercise rights under Art.86 (right to explanation) based on information documented through Art.52 → Art.55 chain

What downstream providers are entitled to receive under Art.55:

When you integrate a GPAI API into a product — particularly a high-risk AI system under Annex III — you are entitled to receive from your GPAI API provider:

A copy of the machine-readable model card (Art.52(1)(b))
A reference to the publicly available training data summary (Art.52(3))
A reference to the copyright compliance policy (Art.52(1)(a)(ii))
The Art.51 classification status of the underlying model (general GPAI or systemic risk)
For systemic risk models: notification of classification change and related obligations

Contractual enforcement of Art.55 rights:

Art.55 does not create an automatic legal transfer — it must be implemented through contracts between GPAI providers and downstream developers. Downstream developers should:

Include explicit Art.52/Art.55 model card provision obligations in API terms of service contracts
Specify the required format (machine-readable) and update frequency
Include provisions for notification of material model changes that affect the model card

CLOUD Act × Art.52 Technical Documentation

Art.52 creates documentation that is highly sensitive from a legal and competitive perspective — and that documentation is directly at risk from CLOUD Act compellability when stored on US cloud infrastructure.

At-risk documentation under Art.52:

Documentation	Art.52 Requirement	CLOUD Act Risk
Cumulative training compute logs	Required for Commission notification and technical documentation	Compellable by US law enforcement — potential disclosure of proprietary training methodology
Training dataset composition	Art.52(1)(a)(i) — types and sources of data	Commercially sensitive; training data selection is a core competitive differentiator
Copyright compliance records	Art.52(1)(a)(ii) — policy and data usage summary	Litigation-relevant; copyright disputes use these records as evidence
Capability and limitations documentation	Art.52(1)(a) — benchmarks and known failures	Competitive intelligence; discloses safety vulnerabilities
Model card versions	Art.52(1)(b) — machine-readable model card history	Version history shows capability evolution; competitive intelligence
Training data opt-out compliance logs	Part of copyright compliance policy	Copyright litigation evidence; dual-access risk

Dual-compellability scenario for GPAI providers:

A GPAI provider operating primarily on US cloud infrastructure (AWS, Azure, GCP) faces the following simultaneous obligations:

EU Art.52(2): Commission can request access to technical documentation — to be provided to EU authorities
CLOUD Act: US law enforcement can compel access to the same records stored on US infrastructure — without an EU court order

These two obligations can conflict: the GPAI provider is simultaneously obligated to protect commercially sensitive documentation for EU regulatory purposes and subject to compelled disclosure to US law enforcement.

EU-native infrastructure as structural mitigation:

Storing GPAI compliance documentation — technical files, training data transparency reports, copyright compliance records, model card versions — on EU-native infrastructure (incorporated in the EU, operating under GDPR and EU administrative law) eliminates the CLOUD Act compellability risk. The records are subject to EU judicial process only, and access by non-EU authorities requires Mutual Legal Assistance Treaty (MLAT) procedures that provide legal visibility.

For developers building compliance infrastructure for GPAI providers — model registries, technical documentation management systems, copyright compliance tracking systems — EU-native PaaS deployment is an architectural requirement to achieve single-jurisdiction compliance.

Python Implementation

GPAITechnicalDocumentationRecord

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
from datetime import date

class GPAITier(Enum):
    GENERAL = "general_gpai"
    SYSTEMIC_RISK = "systemic_risk_gpai"

class DocumentationStatus(Enum):
    DRAFT = "draft"
    COMPLETE = "complete"
    UNDER_REVIEW = "under_review"
    APPROVED = "approved"

@dataclass
class GPAITechnicalDocumentationRecord:
    """Art.52(1)(a) — GPAI model technical documentation record."""
    
    model_name: str
    model_version: str
    provider_name: str
    tier: GPAITier
    architecture_type: str           # e.g., "transformer", "diffusion", "mixture-of-experts"
    parameter_count: Optional[int]   # May be confidential — None if not disclosed
    context_window: Optional[int]    # Token context window
    modalities: list[str]            # e.g., ["text", "image", "code"]
    training_methodology: str        # Pre-training approach description
    fine_tuning_steps: list[str]     # List of fine-tuning and alignment procedures
    intended_purposes: list[str]     # Documented intended use cases
    supported_languages: list[str]   # BCP 47 language codes
    capabilities_summary: str        # Narrative capabilities description
    limitations_summary: str         # Narrative limitations and known failure modes
    benchmark_results: dict[str, float] = field(default_factory=dict)  # name -> score
    safety_evaluations: list[str] = field(default_factory=list)
    cumulative_training_flops: Optional[float] = None  # 10^25 threshold reference
    documentation_version: str = "1.0.0"
    documentation_date: str = ""
    status: DocumentationStatus = DocumentationStatus.DRAFT
    
    def __post_init__(self):
        if not self.documentation_date:
            self.documentation_date = date.today().isoformat()
    
    def is_systemic_risk_threshold_met(self) -> bool:
        """Check if cumulative training compute exceeds Art.51 threshold."""
        if self.cumulative_training_flops is None:
            return False
        return self.cumulative_training_flops >= 1e25
    
    def validate_completeness(self) -> list[str]:
        """Return list of missing mandatory Art.52(1)(a) elements."""
        gaps = []
        if not self.architecture_type:
            gaps.append("Art.52(1)(a): Model architecture type not documented")
        if not self.training_methodology:
            gaps.append("Art.52(1)(a): Training methodology not documented")
        if not self.intended_purposes:
            gaps.append("Art.52(1)(a): Intended purposes not documented")
        if not self.capabilities_summary:
            gaps.append("Art.52(1)(a): Capabilities summary not documented")
        if not self.limitations_summary:
            gaps.append("Art.52(1)(a): Limitations summary not documented")
        if not self.benchmark_results:
            gaps.append("Art.52(1)(a): No benchmark results documented")
        if not self.safety_evaluations:
            gaps.append("Art.52(1)(a): No safety evaluation results documented")
        return gaps
    
    def to_commission_submission(self) -> dict:
        """Prepare record for Art.52(2) Commission access submission."""
        return {
            "provider": self.provider_name,
            "model": f"{self.model_name} v{self.model_version}",
            "tier": self.tier.value,
            "architecture": self.architecture_type,
            "modalities": self.modalities,
            "methodology": self.training_methodology,
            "purposes": self.intended_purposes,
            "capabilities": self.capabilities_summary,
            "limitations": self.limitations_summary,
            "benchmarks": self.benchmark_results,
            "safety": self.safety_evaluations,
            "compute": self.cumulative_training_flops,
            "version": self.documentation_version,
            "date": self.documentation_date,
        }

TrainingDataTransparencyReport

@dataclass
class TrainingDataSource:
    """A single training data source entry."""
    source_name: str            # e.g., "Common Crawl", "Books3", "GitHub"
    data_type: str              # e.g., "web_text", "books", "code", "scientific_papers"
    collection_method: str      # "scraped", "licensed", "proprietary", "synthetic"
    approximate_size_tokens: Optional[int] = None
    languages: list[str] = field(default_factory=list)
    geographic_coverage: list[str] = field(default_factory=list)  # ISO 3166-1 alpha-2 codes
    collection_period_start: Optional[str] = None  # ISO date
    collection_period_end: Optional[str] = None    # ISO date  
    tdm_opt_out_checked: bool = False
    licence_held: Optional[str] = None             # SPDX identifier or "proprietary"
    is_confidential: bool = False                  # Trade secret protection flag

@dataclass
class TrainingDataTransparencyReport:
    """Art.52(1)(a)(i) — Training data transparency report."""
    
    model_name: str
    model_version: str
    report_date: str
    sources: list[TrainingDataSource]
    total_training_tokens: Optional[int] = None
    quality_filtering_methods: list[str] = field(default_factory=list)
    deduplication_methods: list[str] = field(default_factory=list)
    toxicity_filtering: bool = False
    nsfw_filtering: bool = False
    pii_filtering: bool = False
    
    def get_geographic_distribution(self) -> dict[str, int]:
        """Aggregate geographic coverage across all sources."""
        distribution: dict[str, int] = {}
        for source in self.sources:
            for country in source.geographic_coverage:
                distribution[country] = distribution.get(country, 0) + 1
        return distribution
    
    def get_language_distribution(self) -> dict[str, int]:
        """Aggregate language coverage across all sources."""
        distribution: dict[str, int] = {}
        for source in self.sources:
            for lang in source.languages:
                distribution[lang] = distribution.get(lang, 0) + 1
        return distribution
    
    def get_data_type_distribution(self) -> dict[str, list[str]]:
        """Group sources by data type."""
        by_type: dict[str, list[str]] = {}
        for source in self.sources:
            if source.data_type not in by_type:
                by_type[source.data_type] = []
            by_type[source.data_type].append(source.source_name)
        return by_type
    
    def generate_public_summary(self) -> dict:
        """Generate Art.52(3) public summary — excludes confidential source details."""
        public_sources = []
        for source in self.sources:
            if not source.is_confidential:
                public_sources.append({
                    "type": source.data_type,
                    "collection_method": source.collection_method,
                    "languages": source.languages,
                    "geographic_coverage": source.geographic_coverage,
                    "period": f"{source.collection_period_start} — {source.collection_period_end}",
                    "tdm_opt_out_respected": source.tdm_opt_out_checked,
                })
        return {
            "model": f"{self.model_name} v{self.model_version}",
            "report_date": self.report_date,
            "total_tokens": self.total_training_tokens,
            "data_sources_count": len(self.sources),
            "public_sources": public_sources,
            "geographic_distribution": self.get_geographic_distribution(),
            "language_distribution": self.get_language_distribution(),
            "quality_measures": {
                "filtering": self.quality_filtering_methods,
                "deduplication": self.deduplication_methods,
                "toxicity_filtering": self.toxicity_filtering,
                "nsfw_filtering": self.nsfw_filtering,
                "pii_filtering": self.pii_filtering,
            }
        }
    
    def check_tdm_opt_out_compliance(self) -> list[str]:
        """Identify sources where TDM opt-out was not checked."""
        non_compliant = []
        for source in self.sources:
            if source.collection_method == "scraped" and not source.tdm_opt_out_checked:
                non_compliant.append(
                    f"Art.52(1)(a)(i)/(ii): {source.source_name} — scraped source, TDM opt-out not verified"
                )
        return non_compliant

CopyrightCompliancePolicy

from enum import Enum

class TDMLegalBasis(Enum):
    ART3_DSMD_RESEARCH = "art3_dsmd_research_exception"    # Art.3 DSMD — mandatory research exception
    ART4_DSMD_GENERAL = "art4_dsmd_general_exception"      # Art.4 DSMD — general TDM exception (opt-out possible)
    LICENSED = "licensed"                                   # Content obtained under licence
    PUBLIC_DOMAIN = "public_domain"                         # Copyright expired or CC0
    OPEN_LICENCE = "open_licence"                           # CC-BY, CC-BY-SA, Apache, etc.
    PROPRIETARY = "proprietary"                             # Provider-created content

@dataclass
class CopyrightCompliancePolicy:
    """Art.52(1)(a)(ii) — Copyright compliance policy for GPAI training data."""
    
    provider_name: str
    model_name: str
    policy_version: str
    effective_date: str
    
    # Legal bases used in training data collection
    legal_bases_used: list[TDMLegalBasis]
    
    # TDM opt-out detection
    tdm_opt_out_detection_method: str  # e.g., "robots.txt + tdmrep.json automated check"
    tdm_opt_out_update_frequency: str  # e.g., "weekly recrawl, retroactive removal on detection"
    
    # Licence compliance
    licence_categories: list[str]     # Types of licences held (not specific licences — trade secret)
    licence_audit_frequency: str      # How often licences are reviewed for compliance
    
    # Dispute resolution
    infringement_report_contact: str  # Email or API endpoint for copyright claims
    response_sla: str                 # Response time for infringement reports
    removal_procedure: str            # How content is removed from training data on valid claim
    
    # Ongoing monitoring
    new_optout_monitoring: bool = True
    new_licence_monitoring: bool = True
    retroactive_compliance_review: bool = False
    
    def validate(self) -> list[str]:
        """Return compliance gaps in the copyright policy."""
        gaps = []
        if TDMLegalBasis.ART4_DSMD_GENERAL in self.legal_bases_used:
            if not self.tdm_opt_out_detection_method:
                gaps.append("Art.52(1)(a)(ii): Art.4 DSMD TDM exception used but no opt-out detection method documented")
        if not self.infringement_report_contact:
            gaps.append("Art.52(1)(a)(ii): No infringement report contact documented")
        if not self.removal_procedure:
            gaps.append("Art.52(1)(a)(ii): No content removal procedure documented")
        return gaps
    
    def to_public_summary(self) -> dict:
        """Generate public-facing copyright compliance summary."""
        return {
            "provider": self.provider_name,
            "model": self.model_name,
            "policy_version": self.policy_version,
            "effective_date": self.effective_date,
            "legal_bases": [lb.value for lb in self.legal_bases_used],
            "tdm_opt_out_respected": TDMLegalBasis.ART4_DSMD_GENERAL in self.legal_bases_used,
            "opt_out_detection": self.tdm_opt_out_detection_method,
            "opt_out_update_frequency": self.tdm_opt_out_update_frequency,
            "infringement_contact": self.infringement_report_contact,
            "response_sla": self.response_sla,
        }

Art.52 Compliance Checklist (40 Items)

Technical Documentation — Art.52(1)(a)

Art.52-1 — Technical documentation drawn up before market placement, covering all Annex XI elements
Art.52-2 — Model architecture type and key parameters documented (modality, architecture class, context window)
Art.52-3 — Training methodology documented (pre-training approach, alignment procedures, fine-tuning steps)
Art.52-4 — Intended purposes documented with specific supported use cases
Art.52-5 — Capabilities documented with benchmark results on domain-relevant evaluations
Art.52-6 — Limitations documented including known failure modes, hallucination characteristics, and known biases
Art.52-7 — Performance on safety-relevant benchmarks documented (e.g., TruthfulQA, HarmBench, MMLU)
Art.52-8 — Documentation update procedure defined — triggering events for revision identified
Art.52-9 — Documentation version control in place — previous versions retained and auditable
Art.52-10 — Documentation accessible for Commission review (Art.52(2)) — production-ready format

Training Data Transparency — Art.52(1)(a)(i)

Art.52-11 — Training data types documented — all major data categories identified
Art.52-12 — Training data sources documented at category level (web, books, code, synthetic, licensed, etc.)
Art.52-13 — Geographic coverage documented — languages and countries of origin identified
Art.52-14 — Training data quality assessment documented — filtering and deduplication methods described
Art.52-15 — Training data collection period documented — from/to dates for each major source
Art.52-16 — Synthetic data components identified and described separately
Art.52-17 — PII filtering procedures documented (GDPR Art.6 lawful basis for any personal data)
Art.52-18 — Toxicity and NSFW filtering methods documented

Copyright Compliance — Art.52(1)(a)(ii)

Art.52-19 — Copyright compliance policy drafted and approved
Art.52-20 — Legal basis for each training data source identified (DSMD Art.3/Art.4, licence, public domain)
Art.52-21 — TDM opt-out detection method documented (robots.txt, tdmrep.json, other machine-readable signals)
Art.52-22 — TDM opt-out update frequency defined — how often scraped sources are rechecked
Art.52-23 — Retroactive removal procedure defined — process for removing content when opt-out detected post-training
Art.52-24 — Licence inventory maintained — all licenced content categories and licences held
Art.52-25 — Licence compliance audit frequency defined
Art.52-26 — Copyright infringement claim contact point established and operational
Art.52-27 — Response SLA for infringement claims defined (e.g., 30-day response)
Art.52-28 — Training data summary for rights holders prepared — sufficient detail to allow work identification

Machine-Readable Model Card — Art.52(1)(b)

Art.52-29 — Model card drawn up in machine-readable format (JSON, YAML, or schema-compliant format)
Art.52-30 — Model card includes Art.51 classification status (general GPAI or systemic risk)
Art.52-31 — Model card includes intended purposes, supported modalities, and language coverage
Art.52-32 — Model card includes capabilities summary and benchmark results references
Art.52-33 — Model card includes limitations and known failure modes
Art.52-34 — Model card references training data summary (Art.52(3)) and copyright policy (Art.52(1)(a)(ii))
Art.52-35 — Model card includes downstream Art.55 obligations for API integrators
Art.52-36 — Model card version-controlled — downstream providers notified on material update

Commission Access and Public Summary — Art.52(2)/(3)

Art.52-37 — Documentation production procedure defined for Art.52(2) Commission requests
Art.52-38 — Response timeline for Commission documentation requests defined and feasible
Art.52-39 — Public training data summary registered in EU AI database (Art.71) via AI Office portal
Art.52-40 — Public summary update procedure defined — triggered by new major model version