EU AI Act NCA Inspection: What Compliance Testing Evidence Inspectors Actually Examine (Art.15 + Annex IV)
Post #1480 in the sota.io EU AI Act Audit Readiness Series
When a National Competent Authority (NCA) arrives to inspect your high-risk AI system, they do not start with your risk management documentation or your EU Declaration of Conformity. They start with your testing records.
This surprises many providers. But it makes sense from an inspector's perspective: testing records are objective, timestamped, and hard to fake retroactively. If your accuracy benchmarks, robustness tests, and bias audits were done properly, they leave an audit trail. If they were not done — or were done without rigor — that gap becomes visible within the first hour of inspection.
This post covers what EU AI Act Article 15 and Annex IV require in terms of testing and validation evidence, what NCA inspectors specifically look for, and how to structure your test documentation to survive inspection.
Why Testing Evidence Is the NCA's First Priority
Under Article 74 of the EU AI Act, NCAs have broad powers to inspect high-risk AI systems placed on the EU market. They can request access to technical documentation, source code, training datasets, and — critically — the testing and validation records that form the evidentiary basis for your conformity claim.
Article 15 establishes the substantive requirements: high-risk AI systems must achieve appropriate levels of accuracy, robustness, and cybersecurity. But the proof that you met those levels lives in your testing documentation.
NCAs look at testing evidence first because it answers the threshold question: was this system actually validated before deployment? Everything else — risk management, human oversight, post-market monitoring — is downstream of that answer.
What Article 15 Requires (and What Inspectors Check)
Article 15 imposes three categories of technical requirements on high-risk AI systems:
1. Accuracy
Your system must achieve "appropriate" accuracy for its intended purpose. The Regulation does not prescribe specific accuracy thresholds — that would be impossible given the diversity of use cases — but it does require you to:
- Define accuracy metrics for your system's specific task (F1 score, precision/recall, confusion matrix, AUC-ROC, or domain-specific KPIs)
- Run validation against independent test sets not used during training
- Document performance by subgroup — accuracy that averages well but degrades significantly for specific demographic groups, geographies, or edge cases is a compliance risk
- Establish and document accuracy thresholds that, if breached post-deployment, trigger your post-market monitoring response
What inspectors flag: accuracy metrics defined only on training data; no held-out test set; performance not broken down by population subgroups; thresholds undefined or defined post-hoc.
2. Robustness
Article 15 requires high-risk AI systems to be resilient against errors, faults, and inconsistencies — whether arising from within the system or from intentional adversarial manipulation.
Your robustness testing documentation should cover:
- Input validation testing — what happens when the system receives malformed, out-of-distribution, or adversarial inputs
- Degradation testing — how performance changes as input quality degrades
- Consistency testing — does the system produce stable outputs for semantically equivalent inputs
- Failure mode documentation — known failure modes identified during testing, their triggers, and the risk controls applied to mitigate them
What inspectors flag: robustness testing limited to happy-path scenarios; no adversarial input testing; failure modes identified but not documented; consistency testing absent.
3. Cybersecurity
Article 15 explicitly extends to cybersecurity of the AI system itself — not just the infrastructure hosting it. This means:
- Threat model for the AI component — how could an attacker manipulate model inputs, outputs, or parameters to subvert the system's intended purpose?
- Adversarial robustness testing — at minimum, testing against gradient-based adversarial examples or input perturbation attacks for systems where adversarial manipulation is a realistic threat
- Access control for model inference endpoints — documentation that unauthorized parties cannot submit inputs or retrieve outputs at scale
- Supply chain verification — provenance and integrity checks for pre-trained components, third-party model weights, or API-based AI services integrated into your system
What inspectors flag: cybersecurity section of testing documentation absent entirely; threat model limited to infrastructure (network, server) without AI-specific attack vectors; pre-trained components used without provenance documentation.
Annex IV: The Technical Documentation Checklist for Testing
Annex IV specifies the minimum content of the technical documentation required under Article 11. The testing-relevant sections include:
Section 2(d): Description of the validation and testing procedures used, including metrics and performance criteria relevant to the intended purpose
Section 2(e): Description of the measures applied for the examination, testing, and validation of the AI system, including information on the datasets used
Section 2(f): Description of the system's performance in terms of accuracy, robustness, and cybersecurity, as referred to in Article 15
Section 2(g): Information on the tests performed and the results obtained to demonstrate compliance with the requirements in Chapter III, Section 2
Section 5: For systems that continue to learn after deployment, post-deployment monitoring plan and performance metrics
NCAs use Annex IV as their inspection checklist. Each section above corresponds to a document (or set of documents) they will request. If a section is missing or thin, the inspection lengthens — because the inspector now needs to probe manually what the documentation should have explained.
The Five Testing Evidence Categories NCAs Examine
Based on the Article 15 requirements and Annex IV structure, NCA inspectors typically organize their review around five evidence categories:
Category 1: Pre-Deployment Validation Package
This is your core testing evidence — the results of all validation runs performed before the system was placed on the market.
Required contents:
- Test set composition (size, sampling strategy, distribution)
- Performance metrics by split (train / validation / test)
- Subgroup performance breakdown
- Date and version of the model tested
- Name of the team or individual who conducted the tests
- Comparison against the performance thresholds defined in your risk management documentation
Common gap: validation package exists but is tied to a version that is no longer deployed. NCAs expect documentation aligned to the current production version.
Category 2: Robustness and Edge Case Testing Records
This category documents testing beyond normal operating conditions.
Required contents:
- Out-of-distribution input test results
- Adversarial input test results (where applicable to threat model)
- Missing data / degraded input quality tests
- Known failure modes and their test triggers
- Remediation actions taken for failure modes discovered during testing
Common gap: robustness testing was performed but not documented separately from general validation — inspectors cannot distinguish edge case results from standard performance metrics.
Category 3: Bias and Fairness Audit Records
For any high-risk AI system that makes decisions affecting individuals, the EU AI Act's data governance requirements (Article 10) and accuracy requirements (Article 15) together create an implicit bias testing obligation.
Required contents:
- Protected characteristics analyzed (age, gender, geographic origin, disability where applicable)
- Fairness metrics used (demographic parity, equalized odds, calibration)
- Test results by subgroup
- Actions taken to address identified disparities
- Residual disparities acknowledged and risk-managed
Common gap: bias testing performed only on training data distribution, not on independent test sets; results not documented in a form that NCAs can evaluate.
Category 4: Post-Deployment Monitoring Evidence
For systems deployed and operating, Article 15 requirements do not end at launch. NCAs will ask for evidence that you are monitoring performance in production.
Required contents:
- Monitoring KPIs defined and tied to pre-deployment thresholds
- Alert or trigger conditions for model performance degradation
- Record of monitoring reviews conducted since deployment
- Any model updates or retraining events triggered by monitoring findings
- Evidence that post-market monitoring findings feed back into risk management
Common gap: monitoring plan exists in documentation but no evidence it is actually operating; no records of monitoring reviews; thresholds defined but alerts never triggered (suspicious if system has been live more than six months).
Category 5: Third-Party and Pre-Trained Component Validation
If your system uses pre-trained model weights, API-based AI services, or embedded third-party components, NCAs will ask how you validated those components for your intended use case.
Required contents:
- Inventory of third-party AI components (models, APIs, libraries)
- Provenance and version documentation for each component
- Your own validation testing of each component in your deployment context
- Any known limitations of third-party components and how you mitigate them
- Contractual arrangements that give you access to the component's own testing documentation
Common gap: assumption that the third-party provider's own testing suffices — it does not. You bear conformity responsibility for the system as deployed, which means you must validate third-party components in your specific use case and document that validation.
How to Organize Testing Evidence for NCA Inspection
Structure matters as much as content. An inspector who cannot find the document they need within two minutes may record an adverse finding — not because the document does not exist, but because the documentation structure does not support audit review.
Recommended structure:
/compliance-evidence/
/testing/
/pre-deployment-validation/
v1.2.3_validation_report_2026-05-15.pdf
v1.2.3_test_set_metadata.json
v1.2.3_subgroup_performance_breakdown.xlsx
/robustness-testing/
v1.2.3_adversarial_test_results_2026-05-10.pdf
v1.2.3_failure_modes_register.xlsx
/bias-fairness-audit/
v1.2.3_fairness_audit_2026-05-12.pdf
v1.2.3_subgroup_metrics.xlsx
/post-deployment-monitoring/
monitoring_plan_v1.2.pdf
monitoring_review_2026-05-01.pdf
monitoring_review_2026-06-01.pdf
/third-party-validation/
component_inventory.xlsx
openai_gpt4_validation_our_usecase_2026-04-20.pdf
Version alignment: every document in /testing/ must reference the same model version currently in production. If you have updated your model since the documents were created, you need updated testing documentation for the current version.
Timestamps and signatories: every test report should carry a timestamp and the identity of the team or individual who ran the test. Anonymous or undated reports are not acceptable under NCA inspection.
Immutability evidence: where possible, store testing records in a system that creates tamper-evident audit trails (content-addressed storage, signed commits, or dedicated compliance document management tools). NCAs increasingly ask for evidence that records were not modified after the fact.
What Triggers an Extended Inspection
An NCA inspection that goes beyond its standard duration is almost always caused by one of three documentation failures:
1. Version mismatch — testing records apply to a previous model version, not the one currently deployed. The inspector must now determine whether the untested changes affected the system's compliance posture. This requires additional evidence gathering and potentially testing the current version during the inspection.
2. Coverage gaps — one or more of the five categories above is absent. The inspector must probe deeper into other documents to reconstruct what should have been explicitly documented. This is time-consuming for both parties.
3. Internal inconsistency — performance metrics in the validation report differ from metrics cited in the risk management documentation or the technical documentation summary. Inconsistency is a red flag that documentation was assembled after the fact rather than maintained continuously.
None of these triggers require bad faith on the provider's part. They arise from ordinary development practices that do not account for the audit requirements of Article 15 and Annex IV. The solution is to treat testing documentation as a compliance artifact from the beginning of the project, not a deliverable assembled before an audit.
August 2, 2026: The Practical Deadline for Testing Evidence
August 2, 2026 is when the main provisions of the EU AI Act apply to high-risk AI systems under Annex III. For providers who have not yet structured their testing evidence for NCA inspection, this is the deadline for completing that work.
The practical order of operations:
- Inventory your current testing evidence against the five categories above. Identify gaps.
- Align documentation to current production version. If you have released model updates since your last validation, run and document updated tests.
- Create the directory structure described above and populate it with existing and new documents.
- Implement monitoring records if post-deployment monitoring documentation is absent.
- Add third-party validation records for any AI components not validated in your deployment context.
Completing steps 1–5 before August 2, 2026 means that if an NCA initiates an inspection in the months following that date, your testing evidence is ready to produce within the inspection's initial documentation request window.
Summary: Testing Evidence Checklist
Before the August 2026 compliance deadline, verify that your testing documentation includes:
Pre-deployment validation
- Test set composition and size documented
- Performance metrics by split (train / validation / test)
- Subgroup performance breakdown
- Accuracy thresholds defined and applied to test results
- Version alignment between test report and current production deployment
Robustness testing
- Out-of-distribution and adversarial input test results
- Known failure modes register with test triggers
- Remediation actions documented for identified failure modes
Bias and fairness
- Protected characteristics tested with defined fairness metrics
- Subgroup results documented
- Residual disparities acknowledged in risk management
Post-deployment monitoring
- Monitoring KPIs tied to pre-deployment thresholds
- Records of monitoring reviews since deployment
- Model update triggers and evidence of updates applied
Third-party components
- Component inventory with provenance documentation
- Your own validation testing for each component
- Known limitations and mitigations documented
See Also
- EU AI Act Art.11 Technical Documentation: The 7 Gaps NCA Inspectors Find Most Often — Companion sprint post covering what Art.11 Annex IV documentation gaps trigger the same NCA scrutiny as inadequate testing evidence
- EU AI Act Art.15 Accuracy Testing: Automated Robustness & Cybersecurity CI/CD Gates for High-Risk AI 2026 — Implementation guide for the Art.15 accuracy and robustness requirements that generate the testing evidence this post describes
- EU AI Act NCA Inspection Response: Step-by-Step Playbook for High-Risk AI Providers — What happens after an NCA initiates contact: how to organize and deliver the testing evidence packet under Art.74
- EU AI Act Audit-Ready Evidence Packet: 30 Documents Every High-Risk AI Provider Must Have Before August 2026 — Sprint series opener mapping all 30 NCA-requested documents, including the testing evidence categories covered here
This is post 4 of 5 in the AUDIT-READINESS-SPRINT-2026 series. The finale (post 5) covers the complete NCA inspection readiness checklist that brings all five series posts together into a single operational self-assessment.
Deploying EU-compliant software on infrastructure that is itself outside the reach of foreign surveillance orders reduces your compliance surface area. sota.io is EU-native managed PaaS — no US parent company, no CLOUD Act exposure, hosted on Hetzner Germany.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.