High-Risk AI Testing & Evaluation Tools: EU Compliance 2026 Guide
Post #3 in the sota.io EU AI Act Omnibus 2026 Series
Your AI hiring system just rejected 40% of applicants from a specific postal code. Your medical diagnostic model has a 12% higher error rate for patients over 65. Your credit scoring algorithm is being audited by the German Federal Financial Supervisory Authority (BaFin). These are not theoretical scenarios — they are the compliance failures that the EU AI Act's high-risk provisions are designed to prevent.
Under the EU AI Act, Annex III covers fourteen categories of high-risk AI systems: from biometric identification to critical infrastructure, employment, education, credit scoring, law enforcement, and border management. If your system falls into any of these categories, Article 9 (Risk Management System), Article 10 (Data Governance), and Article 15 (Accuracy, Robustness, Cybersecurity) impose mandatory testing and evaluation obligations that went into effect for most Annex III applications on 2 August 2025.
The EU AI Act Omnibus 2026 — currently in final trilogue and expected to pass July 2026 — extends these obligations to a broader set of operators, raises the SME exemption threshold to 750 employees (down from 250), and adds new GPAI testing requirements. If you thought your 300-person company was exempt, think again.
This guide evaluates the leading AI testing and evaluation tools against the specific requirements of EU AI Act Articles 9, 10, and 15, and identifies which tools give you defensible documentation for notified body conformity assessment.
What the EU AI Act Actually Requires
Article 9: Risk Management System
Art.9 is the most operationally demanding provision. It requires a continuous risk management system throughout the entire lifecycle of a high-risk AI system — not a one-time certification but an ongoing process. The system must:
- Identify and analyse known and foreseeable risks
- Evaluate risks arising from the intended use and foreseeable misuse
- Identify residual risks after risk mitigation measures
- Adopt appropriate risk management measures
Critically, Art.9(2) requires testing that produces "adequate" performance metrics across the relevant population groups. This means your testing infrastructure must demonstrate that performance is consistent across demographic groups — age, gender, ethnicity, disability status — and that the system does not disproportionately harm any group.
What this means for tools: You need tools that can run structured bias audits, demographic parity tests, and adverse impact analyses — and produce documented evidence suitable for a notified body audit.
Article 10: Data Governance
Art.10 requires that training, validation, and test data sets meet specific quality standards:
- Relevant, representative, free of errors, complete
- Have regard to the characteristics of the intended purpose
- Account for geographic, behavioural, and functional settings
- Free from discriminatory effects
Art.10(5) permits testing with special category data (health, ethnicity, political views) "to the extent strictly necessary for the purpose of detecting and correcting bias." This is the narrow exception that enables demographic bias testing — but it requires data minimisation documentation.
What this means for tools: Your testing pipeline must track data lineage, document data quality metrics, and demonstrate that protected attributes are only used for bias detection, not model training.
Article 15: Accuracy, Robustness, Cybersecurity
Art.15 requires that high-risk AI systems "achieve an appropriate level of accuracy, robustness, and cybersecurity" and maintain those levels consistently. This includes:
- Stated accuracy metrics in the technical documentation
- Robustness against adversarial attacks (Art.15(3))
- Resilience to errors, faults, inconsistencies (Art.15(4))
- Fallback mechanisms
The adversarial robustness requirement is particularly significant for LLM-based high-risk applications. If you're using a language model for medical documentation, CV screening, or legal reasoning, you need to demonstrate that it cannot be manipulated through prompt injection, adversarial inputs, or evasion attacks.
The Testing Tool Landscape
We evaluated eight tools against four dimensions:
- Art.9 Coverage — Can it produce documented risk management evidence?
- Art.10 Coverage — Does it address bias, data quality, and demographic parity?
- Art.15 Coverage — Does it test adversarial robustness?
- EU Sovereignty Score — Is data processing within the EU? Is there CLOUD Act exposure?
Garak (NVIDIA, Open Source)
What it is: Garak (Generative AI Red-teaming and Assessment Kit) is NVIDIA's open-source LLM vulnerability scanner. Released in 2023, it runs automated probes against language models to detect failure modes including prompt injection, jailbreaks, hallucinations, bias, and toxic content generation.
Art.9 Coverage: Moderate. Garak runs structured probe batteries that can document specific failure modes. Its atkgen, dan, promptinject, and suffix probes directly test the adversarial manipulation scenarios Art.15(3) requires. The continuation and knownbadsignatures probes test for known harmful output patterns.
Art.10 Coverage: Limited. Garak has bias detectors but they are primarily oriented toward LLM output bias (toxic, demographic) rather than the structured statistical bias testing that Art.10 requires for Annex III systems. For employment AI or credit scoring, Garak alone is insufficient.
Art.15 Coverage: Strong. This is Garak's primary strength. The 200+ probes across adversarial, injection, encoding, and generation categories directly support Art.15(3) adversarial robustness documentation.
EU Sovereignty Score: 24/25. Garak runs locally on your infrastructure. No data leaves your environment. CLOUD Act exposure: zero. The only sovereignty concern is the model endpoint you're testing — if you're testing OpenAI GPT-4, your test inputs go to Microsoft/OpenAI servers. For EU-deployed models (Mistral, Ollama, custom), Garak is fully sovereign.
Verdict: Excellent Art.15 tooling, weak for Art.10. Best suited as one component of a broader EU AI Act compliance stack. MIT license, actively maintained by NVIDIA Research.
PyRIT (Microsoft, Open Source)
What it is: Python Risk Identification Tool for GenAI (PyRIT) is Microsoft's open-source red-teaming framework for generative AI applications. It supports multi-turn adversarial conversations, automated red-teaming via "orchestrators," and integration with Azure AI Studio.
Art.9 Coverage: Good. PyRIT's orchestrator pattern enables structured risk assessment workflows. The PromptSendingOrchestrator and RedTeamingOrchestrator can be used to systematically document risk scenarios in a format suitable for Art.9 risk registers.
Art.10 Coverage: Moderate. PyRIT includes bias and harm scoring via its scoring engine, but the integration with Azure AI Content Safety creates a data sovereignty problem.
Art.15 Coverage: Strong. PyRIT was explicitly designed for adversarial robustness testing including prompt injection, jailbreaks, and multi-turn evasion attacks.
EU Sovereignty Score: 11/25. Here is the critical problem for EU compliance: PyRIT's default configuration routes scoring through Azure AI Content Safety, which processes data in Microsoft's cloud infrastructure. Under the US CLOUD Act (2018), Microsoft is required to produce stored communications and data to US law enforcement regardless of where those servers are physically located.
For Annex III systems processing special categories of personal data (employment decisions, credit assessments, medical AI), this is a structural compliance risk. You can configure PyRIT to use local scoring models, but this requires significant additional engineering — the documentation and default examples all assume Azure integration.
Verdict: Powerful tool built primarily for the Azure ecosystem. EU compliance teams must audit every component for CLOUD Act exposure before using in Annex III testing pipelines. Proceed with caution.
Fraunhofer AI Red Team Methodology
What it is: Not a SaaS tool but a structured methodology developed by Fraunhofer IAIS (Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany). Fraunhofer offers AI auditing services, consulting, and a published methodology framework aligned with EU AI Act requirements.
Art.9 Coverage: Excellent. Fraunhofer IAIS developed the AI Testing Initiative in collaboration with the German Federal Office for Information Security (BSI). Their methodology maps directly to Art.9 risk categories and produces audit-ready documentation.
Art.10 Coverage: Excellent. Fraunhofer IAIS has published extensive research on algorithmic bias, fairness metrics, and demographic impact assessment. Their tools and methodologies address the specific statistical requirements of Art.10 for employment, credit, and healthcare AI.
Art.15 Coverage: Excellent. Fraunhofer IAIS has adversarial robustness testing capabilities including membership inference attacks, model inversion attacks, and adversarial example generation.
EU Sovereignty Score: 25/25. German public research institution. All data processing occurs in Germany under German data protection law. Zero CLOUD Act exposure. Funded by the German Federal Ministry of Education and Research. This is as sovereign as it gets.
The catch: Fraunhofer IAIS provides consulting services, not self-service software. Engagement typically requires a formal research collaboration or commercial contract. For a 300-person SaaS company, the engagement model may be impractical — though they do offer structured audit packages for SMEs.
Verdict: The gold standard for EU AI Act compliance from an EU sovereignty perspective. The methodology is reference-quality. Consider using it as a framework even if you implement tools yourself.
IBM OpenPages (IBM, Commercial)
What it is: IBM OpenPages is IBM's governance, risk, and compliance (GRC) platform with an AI Risk module. It provides workflow management for AI risk assessment, documentation of AI system inventories, and integration with IBM Watson-based monitoring.
Art.9 Coverage: Strong. OpenPages was explicitly updated in 2024-2025 to support EU AI Act Art.9 risk management workflows. The AI Inventory module can track high-risk AI system classifications, and the Risk Assessment module generates documentation suitable for conformity assessment submissions.
Art.10 Coverage: Moderate. OpenPages manages the documentation and workflow around data quality but does not itself perform the technical tests. Integration with IBM OpenScale (now Watson OpenScale / IBM AI Fairness 360) provides the actual bias testing capability.
Art.15 Coverage: Limited as standalone. Requires integration with IBM Security Verify or external red-teaming tools.
EU Sovereignty Score: 14/25. IBM operates EU data centers (Frankfurt, Amsterdam). However, IBM Corporation is a US entity subject to the CLOUD Act. IBM's EU-specific contracts include data processing agreements and sovereignty commitments, but the structural legal exposure remains. For the most sensitive Annex III applications (biometric, healthcare, law enforcement), this creates the same CLOUD Act risk as Microsoft Azure.
IBM does offer "IBM Cloud Sovereign Controls" with EU-specific governance, but this is an enterprise add-on with significant cost and complexity.
Verdict: Best-in-class for GRC workflow management and documentation. The CLOUD Act exposure is manageable with proper contractual controls for most Annex III categories but remains a legal risk for the most sensitive applications. Pricing starts at €120,000/year for enterprise deployments.
AIShield (Bosch, EU-Native Commercial)
What it is: AIShield is a commercial AI security and robustness testing platform developed by Bosch Global Software Technologies (Bengaluru, India) with EU operations through Bosch Group. AIShield focuses on AI model hardening, vulnerability assessment, and adversarial robustness testing.
Art.9 Coverage: Good. AIShield provides structured vulnerability assessments that map to Art.9 risk categories. Its "AI Guardian" product includes risk scoring and documentation suitable for conformity assessment.
Art.10 Coverage: Moderate. AIShield includes fairness testing modules but the depth is less than dedicated bias testing frameworks.
Art.15 Coverage: Strong. AIShield's core product is adversarial robustness testing — model extraction attacks, membership inference, adversarial examples, and model poisoning detection.
EU Sovereignty Score: 18/25. Bosch is a German company with EU data sovereignty commitments. AIShield processes data in Bosch's EU cloud infrastructure. However, Bosch Global Software Technologies is headquartered in India, and the ownership structure creates some complexity for strict EU sovereignty requirements. For most Annex III applications, the Bosch EU infrastructure provides adequate sovereignty. For law enforcement or national security AI applications, additional legal review is warranted.
Verdict: A practical EU-leaning alternative to US cloud tooling for AI security testing. The Bosch enterprise backing provides the SLA and support model that many enterprises require.
IBM AI Fairness 360 (Open Source)
What it is: AIF360 is IBM's open-source Python library for detecting, understanding, and mitigating algorithmic bias. Originally developed for the NIST AI Risk Management Framework, it has been widely adopted for EU AI Act Art.10 compliance work.
Art.9 Coverage: Moderate. AIF360 focuses on bias detection and mitigation, not comprehensive risk management. It produces statistical evidence but not complete Art.9 risk documentation.
Art.10 Coverage: Excellent. AIF360 implements over 70 fairness metrics and 10 bias mitigation algorithms. For the statistical requirements of Art.10 — demographic parity, equal opportunity, disparate impact — AIF360 is the most complete open-source tool available.
Art.15 Coverage: Limited. AIF360 focuses on fairness, not adversarial robustness.
EU Sovereignty Score: 25/25. Open source, runs entirely on your infrastructure. Zero CLOUD Act exposure.
Verdict: The Art.10 complement to Garak's Art.15 strength. Used together, these two open-source tools address 80% of the technical testing requirements for most Annex III applications.
DeepEval (Confident AI, Open Source + Commercial)
What it is: DeepEval is an open-source LLM evaluation framework for testing LLM applications in CI/CD pipelines. It supports metrics for hallucination, faithfulness, answer relevancy, toxicity, bias, and custom evaluation criteria.
Art.9 Coverage: Moderate. DeepEval's structured evaluation pipelines can be integrated into continuous risk monitoring workflows. Its "ragas"-compatible interface allows integration with retrieval-augmented generation (RAG) system evaluation.
Art.10 Coverage: Good. DeepEval includes bias detection and toxicity scoring, though the depth is primarily focused on LLM output quality rather than statistical fairness in the Art.10 sense.
Art.15 Coverage: Moderate. Hallucination and faithfulness metrics contribute to Art.15 accuracy requirements. Adversarial testing is limited compared to Garak.
EU Sovereignty Score: 22/25. Can run entirely locally. The cloud dashboard (Confident AI) routes data to US infrastructure — disable it for EU compliance use cases.
Verdict: Excellent for LLM-based applications needing continuous evaluation in CI/CD pipelines. Best paired with Garak for adversarial testing and AIF360 for bias analysis.
Tool Comparison Matrix
| Tool | Art.9 Risk Mgmt | Art.10 Bias/Data | Art.15 Robustness | EU Sovereignty | Price |
|---|---|---|---|---|---|
| Garak (NVIDIA OSS) | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ | 24/25 | Free |
| PyRIT (Microsoft OSS) | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | 11/25 | Free (Azure costs extra) |
| Fraunhofer IAIS | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 25/25 | Consulting (€€€) |
| IBM OpenPages | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | 14/25 | €120K+/year |
| AIShield (Bosch) | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | 18/25 | €15K-80K/year |
| AIF360 (IBM OSS) | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | 25/25 | Free |
| DeepEval (Confident AI) | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | 22/25 | Free (OSS) / €299+/mo |
The EU-Native Stack Recommendation
For most Annex III high-risk AI systems operated by EU companies, the optimal testing stack is:
Tier 1: Fully EU-Sovereign (SME/Startup)
Garak + AIF360 + DeepEval (local mode)
Cost: €0 in licensing, ~2-3 engineer-weeks to integrate Art.9 coverage: Adequate with documentation templates Art.10 coverage: Strong (AIF360) Art.15 coverage: Strong (Garak)
This stack covers the technical testing requirements. You will still need a legal/compliance team to produce the conformity assessment documentation, technical documentation, and post-market monitoring plan.
Tier 2: Enterprise with Sovereignty Requirements
AIShield + AIF360 + Garak + Internal GRC system
Cost: €15K-40K/year (AIShield) + open source tooling Adds: Structured risk documentation workflow, enterprise SLA, EU data sovereignty contracts Suitable for: Mid-market companies (250-2000 employees) in employment, credit, healthcare AI
Tier 3: Regulated/Critical Infrastructure
Fraunhofer IAIS engagement + IBM OpenPages (EU deployment) + Custom testing pipeline
Cost: €100K-500K+ (Fraunhofer) + €120K+ (OpenPages) Adds: Defensible third-party audit trail for notified body assessment, full methodology documentation, regulatory expertise Suitable for: Banking AI, national critical infrastructure, biometric systems under Annex III point 1
CLOUD Act: The Hidden Compliance Risk
The gap between "testing tool" and "EU-compliant testing tool" often comes down to CLOUD Act exposure. PyRIT routes through Azure. OpenPages GRC workflows may involve IBM US cloud infrastructure. Even open-source tools like Garak can create exposure if you're testing models hosted on US platforms.
The CLOUD Act risk is not hypothetical for Annex III systems. If you're running an employment screening AI that processes thousands of CVs, and your testing infrastructure routes test inputs (even synthetic ones) through a US cloud provider, you have a potential legal exposure under:
- GDPR Art.44 (restrictions on international data transfers)
- EU AI Act Art.10(2)(e) (data governance)
- NIS2 Art.21 (security of network and information systems)
Practical rule: For Annex III systems processing personal data, your testing pipeline should not touch US cloud infrastructure. Garak + AIF360 + DeepEval (local) achieves this. If you use PyRIT, configure it to use locally-hosted scoring models.
What Changes Under the Omnibus 2026
The EU AI Act Omnibus — expected to pass trilogue in July 2026 — makes three changes relevant to testing requirements:
1. Extended GPAI Testing Obligations. General-purpose AI models with systemic risk (computational threshold reduced from 10^25 to 10^23 FLOPs in the Omnibus draft) must now undergo structured red-teaming. This means models like Mistral Large or Command R+ may now require Garak-level adversarial testing if integrated into high-risk applications.
2. SME Testing Proportionality. The Omnibus introduces a "proportionate testing" principle for companies below 750 employees. Rather than the full Art.9 risk management system, smaller operators can document a "risk-proportionate approach" — essentially a lighter-weight version of the full conformity assessment. Practically, this means the Garak + AIF360 stack becomes a legally defensible approach for most SMEs.
3. Regulatory Sandbox Testing Data. The Omnibus expands the scope of regulatory sandboxes to allow testing with real personal data under supervised conditions. German AI sandboxes (BMBF/AI REGIO) and the French sandbox (ANS) will provide supervised environments for high-risk system testing — useful for healthcare and employment AI where synthetic data is insufficient.
Building Your Art.9 Test Documentation
A common mistake is confusing "running the tests" with "having EU AI Act documentation." Notified bodies need to see structured evidence, not Jupyter notebook outputs.
For each test run, document:
1. Test Scope
- Which model/version was tested
- Which Annex III category applies
- Which Art.9/10/15 requirements the test addresses
2. Test Methodology
- Tool used (Garak version, AIF360 version)
- Probe categories executed (for Garak: list the modules)
- Protected attributes tested (for AIF360: which demographics)
- Test data description (synthetic/real, size, provenance)
3. Results
- Pass/fail per probe category
- Bias metrics with confidence intervals
- Adversarial robustness score
4. Residual Risk Statement
- What risks were identified
- What mitigations were applied
- What residual risk remains (Art.9(1)(c))
5. Monitoring Plan
- How often tests will be repeated
- Trigger conditions for re-testing (model update, distribution shift)
- Who is responsible
This documentation structure maps directly to the conformity assessment technical documentation requirements in Annex IV of the EU AI Act.
The Conformity Assessment Gap
Most EU companies testing their AI systems with the tools listed above will pass a technical audit — but fail the conformity assessment because of documentation, not technology. The tools exist. The gap is the documented quality management system (Art.12, Art.17) that demonstrates ongoing compliance.
The EU AI Act's quality management requirement means your testing process itself must be documented, reviewed, and maintained. It's not enough to run Garak once before launch — you need a living document that shows:
- How testing is triggered (model updates, production drift, new deployment contexts)
- Who reviews results and who is responsible for risk decisions
- How test findings translate into model changes or deployment restrictions
- Audit trail of all test runs with timestamps and personnel
For most SMEs, this means implementing a lightweight internal AI Risk Register. A simple structured database (even a well-formatted Notion/Confluence space with version history) can satisfy this requirement if it covers the elements above.
Practical Next Steps
If your company operates a high-risk AI system under Annex III:
-
Classify your system. Map your AI application to the 14 Annex III categories. If you're unsure, the European AI Office published classification guidance in Q1 2025 (reference: EAIO-2025-032).
-
Assess testing infrastructure gaps. Run a Garak scan against your model. Run AIF360 analysis on your training data. These two free tools identify 80% of the issues that would fail a conformity assessment.
-
Document, document, document. Every test run generates evidence. Store it in a structured format from day one.
-
Check CLOUD Act exposure. Audit your testing pipeline for US cloud infrastructure. Restructure to run locally where possible.
-
Consider Fraunhofer IAIS engagement. For high-stakes Annex III applications (employment, credit, healthcare), a Fraunhofer IAIS audit engagement produces the most defensible third-party documentation.
The EU AI Act's high-risk provisions are not optional, and the August 2025 application date has passed. If you haven't started your conformity assessment, the time is now — not after your first regulatory inquiry.
sota.io helps EU SaaS and AI companies deploy compliant infrastructure. All processing in EU data centers with no CLOUD Act exposure. Start free →
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.