2026-04-14·13 min read·sota.io team

GPAI Code of Practice Chapter 3: Adversarial Testing, Red-Teaming, and Incident Reporting for Systemic Risk AI (2026)

The GPAI Code of Practice, adopted by the EU AI Office in July 2025, has three chapters. Chapters 1 (Transparency) and 2 (Copyright) apply to all GPAI model providers. Chapter 3 (Safety & Security) applies only to Systemic Risk providers — those whose models exceed the 10^25 FLOPs training compute threshold defined in Art.51(1)(a) of the EU AI Act, or whose models are designated as Systemic Risk by the AI Office under Art.51(1)(b).

This post is a technical deep-dive into Chapter 3. Our earlier guide (GPAI Code of Practice Final: Implementation Guide) covered all three chapters at overview level. This guide focuses exclusively on what Chapter 3 actually requires in practice: the ten Safety & Security measures (S-01 through S-10), how to run a compliant red-teaming program, what triggers incident reporting, and what the cybersecurity controls mean at the implementation level.

If your GPAI model is below the 10^25 FLOPs threshold and has not been individually designated by the AI Office, Chapter 3 does not apply to you. If it does apply, Chapter 3 is the most operationally demanding part of GPAI compliance.

The Systemic Risk Threshold: Art.51 and CoP Chapter 3 Scope

Before Chapter 3 obligations apply, the provider must determine whether their model qualifies as a Systemic Risk GPAI model.

The Threshold: 10^25 FLOPs

Art.51(1)(a) sets the primary trigger: a GPAI model is presumed to present systemic risk if it was trained using a total computing power of more than 10^25 floating point operations (FLOPs). This threshold was first established in the EU AI Act's recitals as an indicative figure for frontier-scale models and was codified in the final text.

The 10^25 FLOPs figure covers training compute across all training runs contributing to the final model — pre-training, fine-tuning on instruction-following data, and safety alignment fine-tuning all count toward the total. Compute figures are typically measured in GPU-hours and converted to FLOPs based on hardware specifications. GPAI CoP Chapter 1 (Transparency) requires providers to document FLOPs in the technical documentation for all GPAI models; for Systemic Risk threshold assessment, this documentation is also the primary evidence for self-assessment.

Training Compute	Art.51 Status	Chapter 3 Applies?
< 10^25 FLOPs	Below threshold — no automatic Systemic Risk	No (unless Art.51(1)(b) designation)
≥ 10^25 FLOPs	Presumed Systemic Risk under Art.51(1)(a)	Yes
Any compute	AI Office designation under Art.51(1)(b)	Yes if designated

AI Office Designation: Art.51(1)(b)

The AI Office can designate any GPAI model as Systemic Risk based on factors other than raw training compute. This pathway exists because training compute is not the only proxy for systemic risk — a model with 5×10^24 FLOPs that achieves dangerous capability levels through exceptionally efficient training or architectural innovation can be designated. Art.51(1)(b) designation is based on capability evaluations, not self-reported compute.

The AI Office publishes the list of designated Systemic Risk GPAI models. Providers that cross the 10^25 FLOPs threshold must notify the AI Office and have 15 working days to submit a self-assessment. The CoP provides a standard format for this notification.

Annual Re-Evaluation

Systemic Risk designation is not permanent. As capability benchmarks evolve and new models enter the market, the AI Office conducts annual reviews. A model that was Systemic Risk may be de-listed; a model that was below threshold may be added. The Chapter 3 obligations track designation status, not just initial classification.

Chapter 3 Structure: Ten Safety & Security Measures

The GPAI CoP Chapter 3 is organized into ten Safety & Security measures grouped into three functional categories:

Pre-Deployment Adversarial Testing (S-01 to S-04): Testing that must occur before initial deployment and before material changes.

Incident Reporting (S-05 to S-07): The process and timeline for reporting serious incidents to the AI Office.

Cybersecurity Controls (S-08 to S-10): Continuous operational controls for model security.

Measure	Category	Core Requirement
S-01	Pre-Deployment	Define adversarial testing scope across five capability categories
S-02	Pre-Deployment	Conduct red-teaming with independent third-party evaluators
S-03	Pre-Deployment	Pre-deployment gate decision incorporating test results
S-04	Pre-Deployment	Retesting triggers for material changes and new deployment contexts
S-05	Incident Reporting	Serious incident definition and internal classification process
S-06	Incident Reporting	72-hour initial notification to AI Office GPAI Incident Portal
S-07	Incident Reporting	15-day root cause analysis report
S-08	Cybersecurity	Prompt injection protection — input validation and output monitoring
S-09	Cybersecurity	Model weight access control — physical and logical security
S-10	Cybersecurity	Anomaly monitoring and behavioral drift detection

Pre-Deployment Adversarial Testing (S-01 to S-04)

S-01: Adversarial Testing Scope

Measure S-01 requires providers to define and document the scope of adversarial testing before each deployment. The CoP specifies five capability categories that must always be included in the testing scope for Systemic Risk models:

Category 1 — CBRN Uplift (Chemical, Biological, Radiological, Nuclear) The model must be tested for its ability to provide uplift to actors seeking to synthesize, acquire, or deploy CBRN agents. "Uplift" means meaningfully increasing the capability of a non-expert actor beyond what is freely available in public literature. Testing uses structured threat scenarios — e.g., a red-teamer plays a chemistry graduate student attempting to synthesize a schedule-1 substance using only the model's guidance. The test measures how much the model's responses narrow the capability gap between an expert and a non-expert in operational terms.

Category 2 — Cyberoffensive Capabilities The model must be tested for its ability to assist in creating novel malware, exploiting previously-unpatched vulnerabilities, or automating offensive cybersecurity operations at scale. This includes both direct code generation and step-by-step operational guidance that constitutes meaningful assistance beyond publicly documented techniques.

Category 3 — Critical Infrastructure Dependencies The model must be tested for its ability to generate accurate, operationally specific guidance for attacking or disrupting critical infrastructure — electrical grids, water treatment systems, financial clearing infrastructure, communications networks. Testing includes both direct attack guidance and scenario reconstruction where indirect guidance could be assembled into an attack plan.

Category 4 — Autonomous Goal-Seeking Behavior The model must be tested for its ability to pursue multi-step goals autonomously in agentic contexts — taking actions in the real world, acquiring resources, resisting shutdown, and deceiving operators or users about its activities. This category is specifically relevant for GPAI models deployed in agentic frameworks. Testing uses sandboxed environments with tool-calling capabilities and structured goal-seeking tasks.

Category 5 — Large-Scale Persuasion and Manipulation The model must be tested for its ability to generate highly effective personalized persuasive content at scale — targeted influence operations, coordinated disinformation campaigns, psychological manipulation at individual level. Testing includes both direct generation tasks and scaled deployment scenarios where the model's outputs could be used for mass influence operations.

For each category, the testing scope document must specify: the threat model, the test methodology, the pass/fail criteria, and the remediation pathway if test results indicate unacceptable risk levels.

S-02: Red-Teaming Methodology and Third-Party Evaluators

Measure S-02 governs who conducts the testing and how. The CoP requires a hybrid model: internal safety team testing plus at least one independent external red-teaming exercise per model generation before initial deployment.

Internal testing: The provider's own safety, alignment, or red-teaming team conducts structured capability evaluations across all five S-01 categories. Internal testing is iterative — it informs model modifications throughout development. Internal testers must have documented independence from the model development team: results cannot be filtered by product owners before safety team review.

External / independent evaluators: At least one external red-teaming exercise must be conducted by a third party that meets independence requirements. The CoP defines independence as: no financial dependency on the model development team, no prior engagement that could create conflict of interest, no access to the model's training data or architecture details beyond what is provided for the specific evaluation.

Third-party evaluator requirements:

Must sign a testing scope agreement with documented methodology before starting
Must have domain expertise in the capability categories being tested (CBRN evaluators need security/scientific background; cyberoffensive evaluators need offensive security expertise)
Must produce a written evaluation report that includes raw results, methodology, and findings
The provider cannot redact or modify the third-party report before submitting it in the pre-deployment gate documentation

The AI Office maintains a non-binding registry of evaluators that have completed CoP methodology training. Using a registry-listed evaluator is not required but creates a documentation trail that simplifies the presumption of conformity argument.

Automated evaluations (model vs. model attacks, benchmark scoring, automated prompt injection testing) may supplement but do not substitute for human red-teaming in the external evaluator requirement.

S-03: Pre-Deployment Gate Decision

Measure S-03 requires that the pre-deployment testing results be formally incorporated into the deployment decision. This is not a bureaucratic requirement — it defines the conditions under which a model may be deployed and what happens when test results indicate unacceptable risk levels.

The CoP requires a documented deployment gate review that includes:

Test results for all five S-01 capability categories (internal and external)
Risk assessment for each category: acceptable / acceptable with mitigations / unacceptable
Mitigation measures implemented in response to findings
Sign-off by a designated responsible officer (not a safety team member — this must be a named individual with decision authority over deployment)

If any capability category is assessed as unacceptable risk level and cannot be mitigated before the target deployment date, the deployment must be delayed. The CoP does not permit deploying with documented unacceptable risk and a plan to fix it post-deployment.

Risk categorization framework for S-03:

Risk Level	Definition	Gate Decision
Acceptable	Testing confirms capability category presents negligible uplift or risk	Deploy permitted
Acceptable with mitigations	Testing identifies risk; mitigations implemented and retested	Deploy permitted after successful retest
Conditionally acceptable	Risk present but below deployment threshold; monitoring plan required	Deploy with enhanced monitoring commitment
Unacceptable	Testing confirms capability category presents meaningful uplift or risk beyond threshold	No deployment until risk is reduced

The deployment gate document must be retained for the duration the model is deployed plus five years, and produced on request to the AI Office.

S-04: Retesting Triggers

Measure S-04 requires that the pre-deployment testing cycle be repeated when specified conditions occur post-initial deployment. The rationale: models that pass initial red-teaming may have capabilities change through fine-tuning, deployment in new contexts, or emergent behaviors discovered in production.

Mandatory retesting triggers under S-04:

Material change to the model: Retraining on new data that changes the model's capability profile, fine-tuning that affects any of the five S-01 capability categories, architectural modifications, or changes to safety training/alignment procedures. "Material change" is assessed against the model's capability profile at the last deployment gate, not a fixed technical threshold.
New high-risk deployment context: Deployment to a new downstream provider that enables agentic use, tool use, or integration with critical infrastructure systems not covered in the original gate documentation.
Significant capabilities discovered post-deployment: If production use reveals a capability that was not present or not tested in the original deployment gate — for example, a model's ability to perform multi-step autonomous actions that was not evident in benchmark testing.
Annual retest: Even without any of the above triggers, Chapter 3 requires at least one full S-01/S-02 cycle per calendar year for deployed Systemic Risk models.

Retesting does not always require a full external red-team exercise. If the trigger is a narrow change — for example, a fine-tuning run that modified only the model's tone and writing style — a targeted internal retest of the affected capability categories may satisfy S-04. The provider must document the rationale for why a limited retest was appropriate.

Incident Reporting (S-05 to S-07)

S-05: Serious Incident Definition

Measure S-05 requires providers to maintain an internal incident classification process that identifies when an event constitutes a "serious incident" triggering reporting obligations under S-06 and S-07.

The CoP defines a serious incident as any of the following:

Death or serious physical harm to persons: Any incident where the model's output is determined to be a causal or contributing factor in physical injury or death. This includes cases where the model provided medical misinformation that led to harmful treatment decisions, operational guidance that contributed to physical accidents, or content that contributed to self-harm.

CBRN-relevant outputs discovered in production: Discovery that the model has provided meaningful uplift in any of the CBRN categories to a user or group of users, regardless of whether harm has occurred. The discovery itself — through monitoring, external report, or law enforcement notification — is a reportable event.

Large-scale psychological or societal harm: Documented use of the model's outputs in large-scale influence operations, coordinated manipulation campaigns, or disinformation events that caused measurable societal harm.

Critical infrastructure incident: Discovery that the model's outputs were used in or contributed to an attack on or disruption of critical infrastructure.

Model security breach: Unauthorized access to model weights, training data, or inference infrastructure by a party with potentially malicious intent.

The CoP specifies that the incident classification must be made by the provider's safety or incident response team, not by the product or business teams. This independence requirement prevents incidents from being reclassified to avoid reporting obligations.

S-06: 72-Hour Initial Notification

Within 72 hours of classifying an event as a serious incident, the provider must submit an initial notification to the AI Office through the GPAI Incident Reporting Portal. The 72-hour clock starts when the provider has sufficient information to classify the event as serious — not when the underlying event occurred.

The initial notification must contain:

Field	Required Content
Incident identifier	Provider-assigned unique ID for tracking
Incident category	Which serious incident definition triggered reporting
Discovery date/time	When the provider became aware of the event
Classification date/time	When the event was classified as serious
Affected model(s)	Model identifiers, versions, and deployment context
Preliminary description	What is known about the event at notification time
Immediate containment actions	Steps taken since discovery to limit harm
Preliminary causal assessment	Best current hypothesis about cause (subject to change)
Known third-party notifications	Law enforcement, downstream providers, affected parties notified

The 72-hour window is explicitly not conditioned on having complete information. Providers must submit the initial notification with available information and update it as the investigation progresses. Failure to submit because the investigation is ongoing is not a valid reason for missing the 72-hour deadline.

Downstream provider notification: If the incident involves a model deployed through downstream providers, the upstream GPAI provider must also notify affected downstream providers within 72 hours. The CoP does not specify a separate timeline for downstream notification — it is assumed to occur in parallel with the AI Office notification.

S-07: 15-Day Root Cause Report

Within 15 calendar days of the initial S-06 notification, the provider must submit a root cause analysis report to the AI Office. The 15-day clock starts from the date of the S-06 notification, not the incident discovery date.

The root cause report must address:

What happened: A complete factual reconstruction of the incident from the earliest detectable precursor through the immediate cause of harm or risk. This includes model inputs and outputs, user interactions, the deployment context, and the causal chain between the model's behavior and the identified harm or risk.

Why it happened: The technical root cause(s). For capability-related incidents, this must include an assessment of whether the capability was present in deployment gate testing, why it was not detected, and what capability evaluation gap exists. For security incidents, this must include the attack vector, the exploitation mechanism, and how existing controls failed.

What was done immediately: The containment actions taken in the 72-hour window after discovery — model access restrictions, downstream provider notifications, affected user notifications, law enforcement engagement.

What will be done to prevent recurrence: Corrective actions with implementation timelines, updated deployment gate criteria, changes to the S-01 capability category scope, security control improvements. Each corrective action must identify the responsible individual and the completion date.

Evidence package: Relevant logs, monitoring data, model evaluation records, user interaction data (pseudonymized per GDPR requirements), and third-party reports. The AI Office may request additional evidence after receiving the root cause report.

The AI Office reviews the root cause report and may:

Accept it as satisfying S-07 requirements
Request additional information or analysis within a specified timeframe
Initiate an investigation under Art.92 if the root cause report suggests systemic compliance failures
Refer the incident to national market surveillance authorities if it falls within their jurisdiction

Cybersecurity Controls (S-08 to S-10)

S-08: Prompt Injection Protection

Measure S-08 requires GPAI providers to implement and maintain technical controls to protect against prompt injection attacks — attempts to override or subvert the model's instructions through adversarial inputs.

Input-level controls:

Input sanitization that detects and flags structured injection patterns (instruction override attempts, role-playing frameworks designed to bypass safety measures, encoded or obfuscated instructions)
Context window management that limits the ability of user-provided content to override system-level instructions
Distinction enforcement between trusted system prompt content and untrusted user content

Output-level controls:

Output monitoring that detects model responses consistent with successful injection — particularly refusal of earlier safety guidelines, sudden context switches, or outputs that are inconsistent with the stated system purpose
Logging of input-output pairs flagged by injection detection for subsequent review

For agentic deployments specifically: When the GPAI model operates with tool-calling capabilities or has access to external data sources, S-08 requires additional controls: validation of tool-call parameters before execution (preventing injection-driven tool misuse), sandboxing of tool execution environments, and explicit re-validation of the model's instruction state before high-consequence tool calls.

The CoP does not specify which technical implementation satisfies S-08 — it specifies the outcome requirement (reduction in successful injection rate to below a defined threshold). Providers document their implementation approach and the effectiveness metrics in the Chapter 3 compliance documentation.

S-09: Model Weight Access Control

Measure S-09 requires that model weights — the trained parameters of the GPAI model — be protected against unauthorized access, exfiltration, or modification.

Physical security: Training compute and inference infrastructure must be hosted in facilities with documented physical access controls. CLOUD Act jurisdiction is specifically identified in the CoP as a risk factor: infrastructure hosted in US-jurisdiction data centers is subject to CLOUD Act orders that could provide US government access without EU legal process. CoP guidance recommends EU-sovereign infrastructure for model weight storage, though this is not a hard requirement.

Logical security:

Access to model weights restricted to individuals with documented operational need
Multi-person authorization for weight download or export operations
Full audit logging of all weight access, download, and transfer operations
Encryption at rest for stored weights and in transit during transfer
Separation of training infrastructure (where weights are created) from inference infrastructure (where weights are deployed) with explicit transfer authorization process

Supply chain security:

Verification of the integrity of model weights before deployment (hash verification against training-output checksum)
Documentation of the weight provenance chain from training output to production deployment
Detection of weight modification attempts through integrity monitoring

S-10: Anomaly Monitoring and Behavioral Drift Detection

Measure S-10 requires continuous monitoring of the model's production behavior to detect changes that may indicate a security incident, capability drift, or successful attack.

Behavioral baseline: The provider must establish a behavioral baseline derived from the deployment gate testing. The baseline documents expected output distributions across the five S-01 capability categories and normal response patterns for representative input types.

Monitoring requirements:

Statistical sampling of production inputs and outputs at a rate sufficient to detect capability-level deviations with statistical significance
Specific monitoring for outputs in or near the five S-01 capability categories
Alert thresholds calibrated to the behavioral baseline, triggering internal review when output distribution shifts beyond specified bounds
Integration between production monitoring and the S-05 incident classification process — monitoring alerts that cannot be explained by benign causes must be escalated to the incident classification process

Response to detected drift: When anomaly monitoring detects a significant deviation from the behavioral baseline, the response process is:

Internal escalation to the safety team for review
If the deviation is consistent with a capability-related serious incident category under S-05, initiate the incident classification process
If the deviation is not immediately explicable, trigger a targeted S-04 retest of the affected capability categories
Document the anomaly, the investigation, and the determination — regardless of whether it resulted in an S-05 classification

Chapter 3 vs Art.53: Statutory Mapping

The GPAI CoP Chapter 3 measures are designed to satisfy the Art.53 statutory obligations for Systemic Risk GPAI models. Understanding the mapping is important for providers who need to demonstrate compliance through the equivalence pathway (non-CoP-signatory) or who need to respond to AI Office investigations.

Art.53 Obligation	CoP Measures	Notes
Art.53(1)(a) — Model evaluation + systemic risk mitigation	S-01, S-02, S-03	Red-teaming = the primary implementation of Art.53(1)(a)
Art.53(1)(a) — Adversarial testing	S-01, S-02	Third-party evaluators = the CoP's implementation of "adequate testing"
Art.53(1)(b) — Incident notification to AI Office	S-05, S-06, S-07	The 72h/15-day timeline is CoP-specific, not mandated by Art.53 text
Art.53(1)(c) — Cybersecurity	S-08, S-09, S-10	CoP gives specific technical form to the statutory cybersecurity obligation
Art.53(1)(d) — Energy efficiency baseline	Not in Chapter 3	Energy reporting is Chapter 1 (Transparency), not Chapter 3

Presumption of conformity: A CoP signatory that has implemented all ten S-01 to S-10 measures and documented that implementation benefits from the Art.56(8) presumption of conformity for Art.53 obligations. The AI Office must prove a violation rather than the provider proving compliance.

Non-signatory equivalence: A provider not signed to the CoP must demonstrate that their alternative measures achieve equivalent protection for each Art.53(1) obligation. The AI Office's published guidance indicates that the Chapter 3 measures set the reference level for equivalence assessment — a non-signatory would need to show why their approach achieves the same or better protection than S-01 through S-10.

Python Tooling: Systemic Risk Chapter 3 Compliance Tracker

from dataclasses import dataclass, field
from datetime import date, datetime, timedelta
from enum import Enum
from typing import Optional


class CapabilityCategory(Enum):
    CBRN_UPLIFT = "cbrn_uplift"
    CYBEROFFENSIVE = "cyberoffensive"
    CRITICAL_INFRASTRUCTURE = "critical_infrastructure"
    AUTONOMOUS_GOAL_SEEKING = "autonomous_goal_seeking"
    LARGE_SCALE_PERSUASION = "large_scale_persuasion"


class RiskLevel(Enum):
    ACCEPTABLE = "acceptable"
    ACCEPTABLE_WITH_MITIGATIONS = "acceptable_with_mitigations"
    CONDITIONALLY_ACCEPTABLE = "conditionally_acceptable"
    UNACCEPTABLE = "unacceptable"


class EvaluatorType(Enum):
    INTERNAL = "internal"
    EXTERNAL_INDEPENDENT = "external_independent"
    AUTOMATED = "automated"


class IncidentCategory(Enum):
    DEATH_OR_SERIOUS_HARM = "death_or_serious_harm"
    CBRN_RELEVANT_OUTPUT = "cbrn_relevant_output"
    LARGE_SCALE_HARM = "large_scale_harm"
    CRITICAL_INFRASTRUCTURE = "critical_infrastructure"
    MODEL_SECURITY_BREACH = "model_security_breach"


@dataclass
class CapabilityTestResult:
    category: CapabilityCategory
    evaluator_type: EvaluatorType
    test_date: date
    risk_level: RiskLevel
    methodology_documented: bool
    third_party_report_received: bool  # Required if evaluator_type == EXTERNAL_INDEPENDENT
    mitigations_implemented: list[str] = field(default_factory=list)
    retest_required: bool = False
    retest_passed: Optional[bool] = None


@dataclass
class PreDeploymentGate:
    gate_date: date
    model_version: str
    responsible_officer: str
    test_results: list[CapabilityTestResult] = field(default_factory=list)
    deployment_approved: bool = False
    approval_conditions: list[str] = field(default_factory=list)

    def can_deploy(self) -> bool:
        """Returns True only if all five categories tested and none unacceptable."""
        if len(self.test_results) < 5:
            return False
        categories_tested = {r.category for r in self.test_results}
        all_categories = set(CapabilityCategory)
        if not all_categories.issubset(categories_tested):
            return False
        for result in self.test_results:
            if result.risk_level == RiskLevel.UNACCEPTABLE:
                return False
            if result.retest_required and result.retest_passed is not True:
                return False
        return True

    def missing_categories(self) -> list[CapabilityCategory]:
        tested = {r.category for r in self.test_results}
        return [c for c in CapabilityCategory if c not in tested]

    def has_external_evaluator(self) -> bool:
        return any(
            r.evaluator_type == EvaluatorType.EXTERNAL_INDEPENDENT
            for r in self.test_results
        )


@dataclass
class IncidentReport:
    incident_id: str
    category: IncidentCategory
    discovery_datetime: datetime
    classification_datetime: datetime
    affected_model_version: str
    preliminary_description: str
    containment_actions: list[str] = field(default_factory=list)
    ai_office_notified_datetime: Optional[datetime] = None
    root_cause_report_submitted_datetime: Optional[datetime] = None
    root_cause_summary: Optional[str] = None

    def notification_deadline(self) -> datetime:
        """72-hour notification deadline from classification."""
        return self.classification_datetime + timedelta(hours=72)

    def root_cause_deadline(self) -> datetime:
        """15-day root cause report deadline from notification."""
        if self.ai_office_notified_datetime:
            return self.ai_office_notified_datetime + timedelta(days=15)
        return self.notification_deadline() + timedelta(days=15)

    def notification_overdue(self) -> bool:
        if self.ai_office_notified_datetime:
            return False
        return datetime.utcnow() > self.notification_deadline()

    def root_cause_overdue(self) -> bool:
        if self.root_cause_report_submitted_datetime:
            return False
        return datetime.utcnow() > self.root_cause_deadline()

    def compliance_status(self) -> dict:
        return {
            "notification_met": self.ai_office_notified_datetime is not None
            and self.ai_office_notified_datetime <= self.notification_deadline(),
            "root_cause_met": self.root_cause_report_submitted_datetime is not None
            and self.root_cause_report_submitted_datetime <= self.root_cause_deadline(),
            "notification_overdue": self.notification_overdue(),
            "root_cause_overdue": self.root_cause_overdue(),
        }


@dataclass
class CybersecurityControls:
    prompt_injection_controls_documented: bool = False
    prompt_injection_effectiveness_metric: Optional[float] = None  # % injection attempts blocked
    model_weight_access_log_active: bool = False
    model_weight_encryption_at_rest: bool = False
    model_weight_multi_person_auth: bool = False
    eu_sovereign_infrastructure: bool = False  # CoP recommendation
    anomaly_monitoring_active: bool = False
    behavioral_baseline_established: bool = False
    anomaly_alert_threshold_defined: bool = False
    last_monitoring_review: Optional[date] = None

    def s08_compliant(self) -> bool:
        return self.prompt_injection_controls_documented

    def s09_compliant(self) -> bool:
        return (
            self.model_weight_access_log_active
            and self.model_weight_encryption_at_rest
            and self.model_weight_multi_person_auth
        )

    def s10_compliant(self) -> bool:
        return (
            self.anomaly_monitoring_active
            and self.behavioral_baseline_established
            and self.anomaly_alert_threshold_defined
        )

    def gaps(self) -> list[str]:
        result = []
        if not self.s08_compliant():
            result.append("S-08: Prompt injection controls not documented")
        if not self.s09_compliant():
            if not self.model_weight_access_log_active:
                result.append("S-09: Model weight access logging not active")
            if not self.model_weight_encryption_at_rest:
                result.append("S-09: Model weights not encrypted at rest")
            if not self.model_weight_multi_person_auth:
                result.append("S-09: Multi-person auth for weight export not implemented")
        if not self.s10_compliant():
            if not self.behavioral_baseline_established:
                result.append("S-10: Behavioral baseline not established")
            if not self.anomaly_monitoring_active:
                result.append("S-10: Anomaly monitoring not active")
            if not self.anomaly_alert_threshold_defined:
                result.append("S-10: Alert thresholds not defined")
        return result


@dataclass
class SystemicRiskCh3Tracker:
    model_name: str
    model_version: str
    systemic_risk_basis: str  # "art51_1a_threshold" or "art51_1b_designation"
    deployment_date: Optional[date]
    latest_gate: Optional[PreDeploymentGate] = None
    incidents: list[IncidentReport] = field(default_factory=list)
    cybersecurity: CybersecurityControls = field(default_factory=CybersecurityControls)
    last_annual_retest: Optional[date] = None

    ENFORCEMENT_DATE = date(2026, 8, 2)

    def days_to_enforcement(self) -> int:
        return (self.ENFORCEMENT_DATE - date.today()).days

    def annual_retest_overdue(self) -> bool:
        if not self.last_annual_retest:
            return True
        return (date.today() - self.last_annual_retest).days > 365

    def chapter3_gaps(self) -> list[str]:
        gaps = []
        # Pre-deployment gate
        if not self.latest_gate:
            gaps.append("S-01/S-02/S-03: No deployment gate on record")
        else:
            if not self.latest_gate.can_deploy():
                for cat in self.latest_gate.missing_categories():
                    gaps.append(f"S-01: Missing test for {cat.value}")
            if not self.latest_gate.has_external_evaluator():
                gaps.append("S-02: No external independent evaluator engaged")
        # Annual retest
        if self.annual_retest_overdue():
            gaps.append("S-04: Annual retest overdue")
        # Open incidents
        for incident in self.incidents:
            if incident.notification_overdue():
                gaps.append(f"S-06: Incident {incident.incident_id} — 72h notification overdue")
            if incident.root_cause_overdue():
                gaps.append(f"S-07: Incident {incident.incident_id} — 15-day root cause report overdue")
        # Cybersecurity
        gaps.extend(self.cybersecurity.gaps())
        return gaps

    def overall_ch3_readiness(self) -> float:
        """Returns 0.0 to 1.0 — fraction of Chapter 3 requirements met."""
        total = 10  # S-01 through S-10
        met = 0
        if self.latest_gate:
            if not self.latest_gate.missing_categories():
                met += 1  # S-01
            if self.latest_gate.has_external_evaluator():
                met += 1  # S-02
            if self.latest_gate.can_deploy():
                met += 1  # S-03
        if not self.annual_retest_overdue():
            met += 1  # S-04
        # S-05: incident classification process (assumed if tracking class used)
        met += 1  # S-05 (tracking infrastructure in place)
        open_incidents = [i for i in self.incidents if not i.notification_overdue()]
        if not open_incidents:
            met += 1  # S-06
        closed_incidents = [
            i for i in self.incidents
            if i.root_cause_report_submitted_datetime is not None
        ]
        if len(closed_incidents) == len(self.incidents):
            met += 1  # S-07
        if self.cybersecurity.s08_compliant():
            met += 1  # S-08
        if self.cybersecurity.s09_compliant():
            met += 1  # S-09
        if self.cybersecurity.s10_compliant():
            met += 1  # S-10
        return met / total

    def generate_ch3_report(self) -> str:
        gaps = self.chapter3_gaps()
        readiness = self.overall_ch3_readiness()
        report = [
            f"Chapter 3 Compliance Report — {self.model_name} {self.model_version}",
            f"Systemic Risk Basis: {self.systemic_risk_basis}",
            f"Enforcement: {self.days_to_enforcement()} days to August 2, 2026",
            f"Overall Readiness: {readiness:.0%}",
            "",
            "GAPS:" if gaps else "No gaps identified.",
        ]
        for gap in gaps:
            report.append(f"  ✗ {gap}")
        return "\n".join(report)

Chapter 3 Readiness Checklist (20 Items)

Pre-Deployment Adversarial Testing (S-01 to S-04):

S-01.1 Testing scope document covers all five capability categories (CBRN, Cyberoffensive, Critical Infrastructure, Autonomous Goal-Seeking, Large-Scale Persuasion)
S-01.2 Each capability category has documented threat model, methodology, and pass/fail criteria
S-01.3 Pass/fail criteria calibrated to meaningful uplift threshold, not just capability presence
S-02.1 Internal safety team has documented independence from product/model development team
S-02.2 External independent evaluator engaged with signed methodology agreement
S-02.3 External evaluator meets domain expertise requirements for tested categories
S-02.4 External evaluator written report received, unredacted, incorporated in gate documentation
S-03.1 Pre-deployment gate review conducted with results from all five categories
S-03.2 Gate sign-off by named responsible officer with deployment authority
S-03.3 No capability category assessed as Unacceptable without delay of deployment
S-04.1 Material change triggers for retesting documented and communicated to model team
S-04.2 Annual retest scheduled and not overdue from last gate

Incident Reporting (S-05 to S-07):

S-05.1 Serious incident definitions documented and distributed to safety and operations teams
S-05.2 Incident classification authority assigned to safety team (not product/business teams)
S-06.1 AI Office GPAI Incident Reporting Portal account established and tested
S-06.2 Internal escalation process ensures classification reaches reporting-responsible individual within hours of discovery
S-07.1 Root cause report template prepared with all required fields pre-populated
S-07.2 Responsible individual for root cause report identified and trained

Cybersecurity (S-08 to S-10):

S-08.1 Prompt injection controls documented with effectiveness metric defined
S-09.1 Model weight access log active, multi-person auth for export implemented, encryption at rest verified
S-10.1 Behavioral baseline from deployment gate testing documented; anomaly monitoring active with defined alert thresholds

Timeline and Enforcement

Chapter 3 obligations became applicable for Systemic Risk providers on August 2, 2025 — the date when GPAI Chapter V obligations entered into force. AI Office enforcement of GPAI obligations — including Chapter 3 — begins August 2, 2026. This means:

Providers who crossed the 10^25 FLOPs threshold before August 2025 should already have Chapter 3 measures in place
Providers crossing the threshold between August 2025 and enforcement date (August 2, 2026) have until enforcement date to achieve compliance
After August 2, 2026, AI Office can impose fines of up to 3% of global annual turnover for Art.53 violations (Art.99(3))

Interaction with the GPAI CoP Final — What Changed in July 2025:

The AI Office adopted the final CoP in July 2025. For providers already in the CoP development process (2023-2025), the final CoP text clarified several Chapter 3 specifics that were not defined in earlier drafts:

The five specific capability categories in S-01 (previously described as "dangerous capabilities" without enumeration)
The independence requirements for external evaluators in S-02 (previously general)
The specific 15-day timeline for root cause reports in S-07 (previously "timely")
The explicit CLOUD Act risk guidance in S-09 (added in final text)

Providers who built their Chapter 3 programs against earlier draft versions should review these four areas against the final July 2025 text.