2026-06-03·5 min read·sota.io Team

EU AI Act NCA Inspection: What Compliance Testing Evidence Inspectors Actually Examine (Art.15 + Annex IV)

Post #1480 in the sota.io EU AI Act Audit Readiness Series

EU AI Act compliance testing evidence for NCA inspectors — validation records and performance metrics dashboard

When a National Competent Authority (NCA) arrives to inspect your high-risk AI system, they do not start with your risk management documentation or your EU Declaration of Conformity. They start with your testing records.

This surprises many providers. But it makes sense from an inspector's perspective: testing records are objective, timestamped, and hard to fake retroactively. If your accuracy benchmarks, robustness tests, and bias audits were done properly, they leave an audit trail. If they were not done — or were done without rigor — that gap becomes visible within the first hour of inspection.

This post covers what EU AI Act Article 15 and Annex IV require in terms of testing and validation evidence, what NCA inspectors specifically look for, and how to structure your test documentation to survive inspection.

Why Testing Evidence Is the NCA's First Priority

Under Article 74 of the EU AI Act, NCAs have broad powers to inspect high-risk AI systems placed on the EU market. They can request access to technical documentation, source code, training datasets, and — critically — the testing and validation records that form the evidentiary basis for your conformity claim.

Article 15 establishes the substantive requirements: high-risk AI systems must achieve appropriate levels of accuracy, robustness, and cybersecurity. But the proof that you met those levels lives in your testing documentation.

NCAs look at testing evidence first because it answers the threshold question: was this system actually validated before deployment? Everything else — risk management, human oversight, post-market monitoring — is downstream of that answer.

What Article 15 Requires (and What Inspectors Check)

Article 15 imposes three categories of technical requirements on high-risk AI systems:

1. Accuracy

Your system must achieve "appropriate" accuracy for its intended purpose. The Regulation does not prescribe specific accuracy thresholds — that would be impossible given the diversity of use cases — but it does require you to:

Define accuracy metrics for your system's specific task (F1 score, precision/recall, confusion matrix, AUC-ROC, or domain-specific KPIs)
Run validation against independent test sets not used during training
Document performance by subgroup — accuracy that averages well but degrades significantly for specific demographic groups, geographies, or edge cases is a compliance risk
Establish and document accuracy thresholds that, if breached post-deployment, trigger your post-market monitoring response

What inspectors flag: accuracy metrics defined only on training data; no held-out test set; performance not broken down by population subgroups; thresholds undefined or defined post-hoc.

2. Robustness

Article 15 requires high-risk AI systems to be resilient against errors, faults, and inconsistencies — whether arising from within the system or from intentional adversarial manipulation.

Your robustness testing documentation should cover:

Input validation testing — what happens when the system receives malformed, out-of-distribution, or adversarial inputs
Degradation testing — how performance changes as input quality degrades
Consistency testing — does the system produce stable outputs for semantically equivalent inputs
Failure mode documentation — known failure modes identified during testing, their triggers, and the risk controls applied to mitigate them

What inspectors flag: robustness testing limited to happy-path scenarios; no adversarial input testing; failure modes identified but not documented; consistency testing absent.

3. Cybersecurity

Article 15 explicitly extends to cybersecurity of the AI system itself — not just the infrastructure hosting it. This means:

Threat model for the AI component — how could an attacker manipulate model inputs, outputs, or parameters to subvert the system's intended purpose?
Adversarial robustness testing — at minimum, testing against gradient-based adversarial examples or input perturbation attacks for systems where adversarial manipulation is a realistic threat
Access control for model inference endpoints — documentation that unauthorized parties cannot submit inputs or retrieve outputs at scale
Supply chain verification — provenance and integrity checks for pre-trained components, third-party model weights, or API-based AI services integrated into your system

What inspectors flag: cybersecurity section of testing documentation absent entirely; threat model limited to infrastructure (network, server) without AI-specific attack vectors; pre-trained components used without provenance documentation.

Annex IV: The Technical Documentation Checklist for Testing

Annex IV specifies the minimum content of the technical documentation required under Article 11. The testing-relevant sections include:

Section 2(d): Description of the validation and testing procedures used, including metrics and performance criteria relevant to the intended purpose

Section 2(e): Description of the measures applied for the examination, testing, and validation of the AI system, including information on the datasets used

Section 2(f): Description of the system's performance in terms of accuracy, robustness, and cybersecurity, as referred to in Article 15

Section 2(g): Information on the tests performed and the results obtained to demonstrate compliance with the requirements in Chapter III, Section 2

Section 5: For systems that continue to learn after deployment, post-deployment monitoring plan and performance metrics

NCAs use Annex IV as their inspection checklist. Each section above corresponds to a document (or set of documents) they will request. If a section is missing or thin, the inspection lengthens — because the inspector now needs to probe manually what the documentation should have explained.

The Five Testing Evidence Categories NCAs Examine

Based on the Article 15 requirements and Annex IV structure, NCA inspectors typically organize their review around five evidence categories:

Category 1: Pre-Deployment Validation Package

This is your core testing evidence — the results of all validation runs performed before the system was placed on the market.

Required contents:

Test set composition (size, sampling strategy, distribution)
Performance metrics by split (train / validation / test)
Subgroup performance breakdown
Date and version of the model tested
Name of the team or individual who conducted the tests
Comparison against the performance thresholds defined in your risk management documentation

Common gap: validation package exists but is tied to a version that is no longer deployed. NCAs expect documentation aligned to the current production version.

Category 2: Robustness and Edge Case Testing Records

This category documents testing beyond normal operating conditions.

Required contents:

Out-of-distribution input test results
Adversarial input test results (where applicable to threat model)
Missing data / degraded input quality tests
Known failure modes and their test triggers
Remediation actions taken for failure modes discovered during testing

Common gap: robustness testing was performed but not documented separately from general validation — inspectors cannot distinguish edge case results from standard performance metrics.

Category 3: Bias and Fairness Audit Records

For any high-risk AI system that makes decisions affecting individuals, the EU AI Act's data governance requirements (Article 10) and accuracy requirements (Article 15) together create an implicit bias testing obligation.

Required contents:

Protected characteristics analyzed (age, gender, geographic origin, disability where applicable)
Fairness metrics used (demographic parity, equalized odds, calibration)
Test results by subgroup
Actions taken to address identified disparities
Residual disparities acknowledged and risk-managed

Common gap: bias testing performed only on training data distribution, not on independent test sets; results not documented in a form that NCAs can evaluate.

Category 4: Post-Deployment Monitoring Evidence

For systems deployed and operating, Article 15 requirements do not end at launch. NCAs will ask for evidence that you are monitoring performance in production.

Required contents:

Monitoring KPIs defined and tied to pre-deployment thresholds
Alert or trigger conditions for model performance degradation
Record of monitoring reviews conducted since deployment
Any model updates or retraining events triggered by monitoring findings
Evidence that post-market monitoring findings feed back into risk management

Common gap: monitoring plan exists in documentation but no evidence it is actually operating; no records of monitoring reviews; thresholds defined but alerts never triggered (suspicious if system has been live more than six months).

Category 5: Third-Party and Pre-Trained Component Validation

If your system uses pre-trained model weights, API-based AI services, or embedded third-party components, NCAs will ask how you validated those components for your intended use case.

Required contents:

Inventory of third-party AI components (models, APIs, libraries)
Provenance and version documentation for each component
Your own validation testing of each component in your deployment context
Any known limitations of third-party components and how you mitigate them
Contractual arrangements that give you access to the component's own testing documentation

Common gap: assumption that the third-party provider's own testing suffices — it does not. You bear conformity responsibility for the system as deployed, which means you must validate third-party components in your specific use case and document that validation.

How to Organize Testing Evidence for NCA Inspection

Structure matters as much as content. An inspector who cannot find the document they need within two minutes may record an adverse finding — not because the document does not exist, but because the documentation structure does not support audit review.

Recommended structure:

/compliance-evidence/
  /testing/
    /pre-deployment-validation/
      v1.2.3_validation_report_2026-05-15.pdf
      v1.2.3_test_set_metadata.json
      v1.2.3_subgroup_performance_breakdown.xlsx
    /robustness-testing/
      v1.2.3_adversarial_test_results_2026-05-10.pdf
      v1.2.3_failure_modes_register.xlsx
    /bias-fairness-audit/
      v1.2.3_fairness_audit_2026-05-12.pdf
      v1.2.3_subgroup_metrics.xlsx
    /post-deployment-monitoring/
      monitoring_plan_v1.2.pdf
      monitoring_review_2026-05-01.pdf
      monitoring_review_2026-06-01.pdf
    /third-party-validation/
      component_inventory.xlsx
      openai_gpt4_validation_our_usecase_2026-04-20.pdf

Version alignment: every document in /testing/ must reference the same model version currently in production. If you have updated your model since the documents were created, you need updated testing documentation for the current version.

Timestamps and signatories: every test report should carry a timestamp and the identity of the team or individual who ran the test. Anonymous or undated reports are not acceptable under NCA inspection.

Immutability evidence: where possible, store testing records in a system that creates tamper-evident audit trails (content-addressed storage, signed commits, or dedicated compliance document management tools). NCAs increasingly ask for evidence that records were not modified after the fact.

What Triggers an Extended Inspection

An NCA inspection that goes beyond its standard duration is almost always caused by one of three documentation failures:

1. Version mismatch — testing records apply to a previous model version, not the one currently deployed. The inspector must now determine whether the untested changes affected the system's compliance posture. This requires additional evidence gathering and potentially testing the current version during the inspection.

2. Coverage gaps — one or more of the five categories above is absent. The inspector must probe deeper into other documents to reconstruct what should have been explicitly documented. This is time-consuming for both parties.

3. Internal inconsistency — performance metrics in the validation report differ from metrics cited in the risk management documentation or the technical documentation summary. Inconsistency is a red flag that documentation was assembled after the fact rather than maintained continuously.

None of these triggers require bad faith on the provider's part. They arise from ordinary development practices that do not account for the audit requirements of Article 15 and Annex IV. The solution is to treat testing documentation as a compliance artifact from the beginning of the project, not a deliverable assembled before an audit.

August 2, 2026: The Practical Deadline for Testing Evidence

August 2, 2026 is when the main provisions of the EU AI Act apply to high-risk AI systems under Annex III. For providers who have not yet structured their testing evidence for NCA inspection, this is the deadline for completing that work.

The practical order of operations:

Inventory your current testing evidence against the five categories above. Identify gaps.
Align documentation to current production version. If you have released model updates since your last validation, run and document updated tests.
Create the directory structure described above and populate it with existing and new documents.
Implement monitoring records if post-deployment monitoring documentation is absent.
Add third-party validation records for any AI components not validated in your deployment context.

Completing steps 1–5 before August 2, 2026 means that if an NCA initiates an inspection in the months following that date, your testing evidence is ready to produce within the inspection's initial documentation request window.

Summary: Testing Evidence Checklist

Before the August 2026 compliance deadline, verify that your testing documentation includes:

Pre-deployment validation

Test set composition and size documented
Performance metrics by split (train / validation / test)
Subgroup performance breakdown
Accuracy thresholds defined and applied to test results
Version alignment between test report and current production deployment

Robustness testing

Out-of-distribution and adversarial input test results
Known failure modes register with test triggers
Remediation actions documented for identified failure modes

Bias and fairness

Protected characteristics tested with defined fairness metrics
Subgroup results documented
Residual disparities acknowledged in risk management

Post-deployment monitoring

Monitoring KPIs tied to pre-deployment thresholds
Records of monitoring reviews since deployment
Model update triggers and evidence of updates applied

Third-party components

Component inventory with provenance documentation
Your own validation testing for each component
Known limitations and mitigations documented

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing

EU AI Act NCA Inspection: What Compliance Testing Evidence Inspectors Actually Examine (Art.15 + Annex IV)

Why Testing Evidence Is the NCA's First Priority

What Article 15 Requires (and What Inspectors Check)

1. Accuracy

2. Robustness

3. Cybersecurity

Annex IV: The Technical Documentation Checklist for Testing

The Five Testing Evidence Categories NCAs Examine

Category 1: Pre-Deployment Validation Package

Category 2: Robustness and Edge Case Testing Records

Category 3: Bias and Fairness Audit Records

Category 4: Post-Deployment Monitoring Evidence

Category 5: Third-Party and Pre-Trained Component Validation

How to Organize Testing Evidence for NCA Inspection

What Triggers an Extended Inspection

August 2, 2026: The Practical Deadline for Testing Evidence

Summary: Testing Evidence Checklist

See Also

Ready to move to EU-sovereign infrastructure?