2026-06-10·5 min read·sota.io Team

EU AI Act Art.9 Testing Requirements: What High-Risk AI Developers Must Test Before August 2026

Post #3 in the EU AI Act Art.9 Risk Management System 2026 Series

EU AI Act Art.9 Testing Requirements — isometric testing pipeline with EU compliance validation gates and metrics dashboard

Art.9 of the EU AI Act does not merely mention testing. It makes testing a structural obligation within the Risk Management System — one that must be planned before development, executed throughout the lifecycle, and documented to a standard that satisfies a conformity assessor. Teams that treat testing as a last step before release will find themselves unable to demonstrate compliance.

The previous posts in this series covered the foundations of an Art.9-compliant RMS and the risk identification methodology you need before testing can begin. This post covers the third stage: turning identified risks into validated test coverage that produces the evidence Art.9 requires. Post 4 will cover continuous monitoring integration, and Post 5 will address the conformity assessment documentation package.

What Art.9 Actually Requires for Testing

Article 9 requires that the Risk Management System include testing as a specific component — not as an afterthought, but as one of the mechanisms by which residual risks are identified, measured, and addressed. The testing obligation sits alongside the risk identification and risk management measure obligations, forming a cycle: identify → test → measure → mitigate → retest.

The regulation is explicit that testing must occur at the stage of development, not only before market release. This has concrete implications for how development teams structure their sprints, their model evaluation pipelines, and their CI/CD tooling. Testing is not a gate at the end of development — it is a continuous activity throughout.

Testing must be performed against defined metrics and probabilistic thresholds that are appropriate to the intended purpose. This rules out generic benchmark results. If your system is an Annex III credit scoring tool, a test that demonstrates 92% accuracy on a general financial dataset is not sufficient — you need metrics that speak to accuracy across protected characteristic groups, across the specific customer segments the system will serve, and across the operational envelope defined in your intended purpose documentation.

Testing must also address the risks identified in the previous phase. There is a direct traceability requirement: each material risk in your risk register must be addressed by at least one test that either validates that the risk is acceptably mitigated or demonstrates that it is not. Risks with no corresponding test are gaps that a conformity assessor will find.

The Three Testing Phases Under Art.9

Art.9-compliant testing is not a single phase. It operates at three distinct points in the development lifecycle, each with different objectives and documentation requirements.

Phase 1 — Development-stage testing occurs during model training, architecture design, and feature engineering. Its purpose is early detection of risk materialisation — catching bias introduced during training, identifying distributional mismatches between training and deployment data, and validating that the model architecture is capable of meeting the performance thresholds you will commit to in the intended purpose documentation.

Development-stage testing does not need to use production data or production-scale infrastructure. It does need to be systematic, documented, and linked to the risk register. An experiment log showing that you evaluated three different bias mitigation strategies and selected one based on measured reduction in disparate impact across protected characteristic groups is exactly the kind of evidence Phase 1 should produce.

Phase 2 — Pre-deployment integration testing occurs before the system is placed on the market or put into service. This is the most intensive testing phase. Its purpose is to validate that the complete system — not just the model, but the model integrated with its input processing pipeline, output safeguards, human oversight interface, and logging infrastructure — performs as required under the conditions described in the intended purpose.

Phase 2 testing must include worst-case scenario analysis. The regulation requires you to test not only typical operating conditions but foreseeable edge cases, including cases arising from foreseeable misuse identified in Phase 1 of the RMS. A credit scoring model tested only on clean, complete credit applications is not compliant if your intended purpose documentation acknowledges that applications will sometimes arrive with missing fields, with data from operators who use different data standards, or from consumers who have never had credit before.

Phase 3 — Post-modification validation testing is triggered whenever a substantial modification occurs. The regulation defines substantial modification precisely enough that you need a documented process for evaluating whether a change crosses the threshold. If it does, Phase 2-equivalent testing must be repeated for the affected dimensions. If it does not, you still need a documented record of the assessment.

The practical implication is that your testing infrastructure cannot be a one-time setup. It needs to be versioned, reproducible, and connected to your change management process so that substantial modifications automatically trigger the appropriate test campaigns.

Defining Metrics and Probabilistic Thresholds

The requirement to test against "defined metrics and probabilistic thresholds appropriate to the intended purpose" is one of the most actionable — and most underimplemented — testing obligations in Art.9.

Metrics must be purpose-specific. A general classification accuracy metric is rarely sufficient for Annex III use cases. The appropriate metrics depend on the domain:

For credit and insurance scoring systems (Annex III, point 5(b)), the relevant metrics include calibration across income deciles, demographic parity or equalised odds across protected characteristics, and the false negative rate for applicants who should qualify. A system that is 94% accurate overall but 78% accurate for applicants from one demographic group is not demonstrating compliance — it is documenting a discrimination risk.

For employment and worker management systems (Annex III, point 4), the relevant metrics include selection rate parity across protected characteristics (national origin, sex, age), predictive validity across subgroups (does the system's predictions correlate with job performance equally across groups?), and consistency of outcomes across geographies if the system is deployed in multiple jurisdictions.

For safety components in critical infrastructure (Annex III, point 2), the relevant metrics include false negative rate at operational thresholds, performance degradation under sensor noise or communication latency, and system behaviour during graceful degradation when sensor inputs are partially unavailable.

Thresholds must be committed in advance. One of the most common compliance failures is defining thresholds after testing rather than before. When you set a threshold of "demographic parity ratio >= 0.8" before running the evaluation, a result of 0.75 is a finding that requires remediation. When you evaluate results and then decide that 0.75 is acceptable, you have produced a metric without a threshold — which does not satisfy Art.9.

Thresholds must be documented before testing begins, must be justified by reference to the intended purpose and deployment context, and must be maintained in version control alongside the model version and test dataset version they apply to.

Probabilistic thresholds are required for probabilistic systems. If your system produces probability scores rather than hard classifications, your thresholds must operate in probability space. A medical imaging AI that produces a cancer probability score cannot simply be evaluated by accuracy at a fixed threshold — it must be evaluated by ROC curve analysis, AUC, calibration curves, and sensitivity/specificity at clinically meaningful operating points. The choice of operating point must be justified against the clinical context and must be documented as part of the RMS.

Test Dataset Requirements

The datasets used for testing are not a private engineering decision — they are a compliance obligation. Art.9, in combination with Art.10 (data and data governance), specifies that test datasets must meet standards for relevance, representativeness, and freedom from known errors.

Representativeness is non-negotiable. The test dataset must be representative of the conditions in which the system will operate. For a system intended to be deployed across EU member states, a test dataset drawn exclusively from German users does not demonstrate compliance for use in Spain or Hungary. For a system intended to process documents in multiple languages, a test dataset in English only is not representative of the operational envelope.

Representativeness must be documented. The documentation must describe what characteristics the test dataset was designed to represent, how you verified that the distribution of those characteristics in the dataset matches the expected distribution in deployment, and what gaps exist between the test dataset and the operational reality (every gap is a potential risk that requires separate treatment in the RMS).

Dataset splits must be rigorous. The distinction between training data, validation data, and test data is fundamental. The test dataset must not have influenced any model parameter or hyperparameter decision. If any member of the test dataset was used in cross-validation during training, hyperparameter search, or early stopping, it is contaminated and cannot serve as a final pre-deployment evaluation.

The regulation does not prescribe specific dataset split ratios, but conformity assessors will look for evidence that the test set is genuinely independent. Standard practice — 70/15/15 or 80/10/10 train/validation/test splits — is generally defensible, but documenting the split methodology and verifying independence is your obligation.

Dataset versioning is a compliance requirement. If your test dataset changes between evaluations, a result from two months ago is not evidence of compliance today. The dataset must be versioned. Each evaluation report must identify the dataset version it used. When the dataset is updated (to add new demographics, new operational scenarios, or corrected labels), re-evaluation against the new dataset version is required and the previous results become historical records rather than current compliance evidence.

Bias and Fairness Testing

Bias testing occupies a specific position in Art.9 compliance because the risks being tested are among the most legally and reputationally significant. The EU AI Act's fundamental rights provisions mean that a high-risk AI system with documented, unaddressed bias will not achieve a valid conformity assessment.

Define the protected characteristics relevant to your deployment context. The EU Charter of Fundamental Rights (Art.21) prohibits discrimination on grounds of sex, racial or ethnic origin, religion or belief, disability, age, and sexual orientation, among others. For any Annex III system that makes or materially influences decisions about individuals, you must define which of these characteristics are relevant to your context and test explicitly for disparate impact across them.

For many systems, direct protected characteristic data will not be available in the training or test dataset — using such data in automated decision-making is itself legally constrained by GDPR. The compliant approach is to use proxy variables (surname-based ethnicity estimation, postcode-based demographic correlation, or statistical testing against external demographic datasets) to test for disparate impact even without direct characteristic data.

Select a fairness metric that matches your use case. There is no single correct fairness metric, and the choice of metric is itself a compliance-relevant decision that must be documented.

Demographic parity (equal selection rates across groups) is appropriate when there is no legitimate reason for selection rates to differ and when historical data that would justify differential rates is suspected to be discriminatory itself. It is the right metric for hiring tools where the underlying job performance predictor is independent of demographic group.

Equalised odds (equal true positive and false negative rates across groups) is appropriate when the correct prediction differs by group but the cost of errors should not. It is the right metric for recidivism prediction tools where false positives (wrongly predicting recidivism) have differential harm across demographic groups.

Calibration parity (equal probability calibration across groups) is appropriate for systems that produce scores used in downstream decisions. A credit scoring model that is miscalibrated for minority borrowers — systematically overestimating or underestimating default risk — causes harm even if its aggregate accuracy is equal.

Document the metric you chose, document why you chose it, and document the threshold at which you consider the metric acceptable. Then run the test and document the results.

Test for intersectionality. The most severe fairness failures often occur at the intersection of multiple protected characteristics — a model that is unbiased across age groups, unbiased across gender groups, but systematically worse for older women. Intersectional testing requires sufficient sample size in intersectional subgroups, which is a dataset design challenge as much as a testing challenge. Document the intersectional groups you tested, the groups you could not test due to insufficient sample size, and how the untested combinations represent a residual risk in your RMS.

Robustness and Adversarial Testing

Art.15 (accuracy, robustness and cybersecurity) works in conjunction with Art.9 to require that high-risk AI systems maintain their performance under adverse conditions. Robustness testing is the mechanism by which you establish that this requirement is met.

Input perturbation testing. The system must be tested against realistic variations in input quality that will occur in deployment. For a medical imaging system, this means testing against images from different scanner manufacturers, different acquisition protocols, and different compression artefact levels. For an NLP-based document processing system, this means testing against OCR errors, handwriting variability, different language registers, and missing fields. Each perturbation type represents a foreseeable condition in which the system must continue to perform above its minimum performance threshold.

Distribution shift testing. The operational data distribution will shift over time — demographic changes, regulatory changes, changes in operator behaviour. Robustness testing should include simulation of plausible distribution shifts and documentation of the performance degradation that results. The RMS should include thresholds at which performance degradation triggers a re-evaluation requirement.

Adversarial input testing. For systems that process inputs that could be manipulated by adversarial actors — particularly systems used in law enforcement, border control, or fraud detection — adversarial testing against common attack patterns (adversarial examples, prompt injection for AI-assisted systems, feature manipulation) is expected. Art.15 explicitly names resistance to attacks as a requirement, and the RMS should document how adversarial testing coverage addresses it.

Graceful degradation testing. What does the system do when a required input is missing, when an upstream component fails, or when confidence scores fall below the threshold at which reliable output is possible? Testing for graceful degradation — not just best-case performance — is part of Art.9 compliance. The system's behaviour when it cannot provide a reliable answer is itself a safety and compliance question.

Documentation Requirements for Annex IV

Every material testing decision and every material test result must be documented in the technical documentation required by Art.11 and Annex IV. The testing documentation is not an addendum to the technical documentation — it is one of its primary components.

The test plan must precede execution. Annex IV requires documentation of the methodology used, which means the test plan must exist before the tests are run. A test plan created after results are already known is not methodology documentation — it is retroactive rationalisation. Conformity assessors have developed heuristics for detecting this.

Results must be traceable to versions. Each test result must be linked to the model version, the dataset version, the software version (including dependencies), and the infrastructure configuration under which it was obtained. The purpose is to make the result reproducible: if the assessor cannot reproduce the result given the versions documented, the result is not auditable.

Negative results must be retained. Testing documentation often includes only passing results. Art.9 compliance requires retaining the evidence of tests that failed — not to document failure as an endpoint, but to document that failure was identified, root cause was analysed, a remediation was implemented, and retesting confirmed improvement. A test record that contains only passing results looks like a test record that was selectively documented. That is a conformity risk.

Testing against each identified risk must be explicit. The link between the risk register and the test plan must be explicit. A conformity assessor reviewing your Annex IV documentation should be able to identify each risk in the risk register, identify the test or tests that address it, read the corresponding test results, and determine whether the risk was validated as acceptably mitigated. This traceability is not optional — it is what distinguishes an Art.9-compliant RMS from a collection of QA reports.

Practical Testing Checklist for August 2026 Compliance

The following checklist organises the Art.9 testing obligations into actionable items. Use it as a starting point for your pre-deployment review.

Pre-testing setup:

Risk register from risk identification phase is complete and version-controlled
Test plan documented before testing begins, with explicit metrics and thresholds for each test
Test dataset versioned, representativeness documented, independence from training data verified
Protected characteristics relevant to deployment context identified
Fairness metric selected with documented justification

Development-stage testing:

Bias testing on training data during model development (not only at final evaluation)
Hyperparameter search separated from final test set
Early detection of distributional mismatch between training and expected deployment data

Pre-deployment integration testing:

Full system testing (model + input pipeline + output safeguards + logging)
Worst-case scenario testing for all foreseeable edge cases from risk identification
Bias and fairness testing against all identified protected characteristics
Robustness testing against realistic input perturbations
Adversarial testing where applicable (systems subject to adversarial inputs)
Graceful degradation testing
All results documented with version traceability

Documentation:

Traceability matrix linking each risk to its test coverage
Test results (including failures) retained with version identifiers
Negative test results retained alongside positive
Substantial modification assessment process documented for post-deployment changes

What Comes Next in This Series

Testing validates that risks identified in the risk identification phase are addressed by the system as built before deployment. But Art.9's requirement for a continuous, lifecycle-spanning Risk Management System does not end at market release.

Post 4 in this series covers continuous monitoring integration — how to set up the operational monitoring systems that detect when the system's real-world performance diverges from its pre-deployment test results, and how to connect monitoring outputs to the RMS in a way that satisfies Art.9's ongoing obligation. Post 5 covers the conformity assessment documentation package — the complete set of Annex IV deliverables that a notified body will review.

If you are evaluating EU-compliant infrastructure to host high-risk AI systems — with the audit trails, access controls, and data residency guarantees that Art.9 and Art.10 require — sota.io provides EU-sovereign cloud infrastructure specifically designed for regulated workloads.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View plans