EU AI Act Automated Compliance Testing: Building a CI/CD Audit Pipeline for High-Risk AI Systems
Post #1463 in the sota.io EU AI Act Compliance Series
The August 2, 2026 EU AI Act deadline creates an uncomfortable reality for engineering teams: your high-risk AI system must carry a paper trail of compliance evidence before it goes to market, but modern ML workflows deploy models weekly or daily. If your compliance process is purely manual — a quarterly review session, a PDF checklist, an annual third-party audit — you will either ship non-compliant models or freeze deployments while lawyers catch up.
The solution is to shift compliance left: embed EU AI Act verification gates directly into your CI/CD pipeline so every model build either passes an automated audit or fails fast with actionable evidence. This post is the first in a five-part series on building that pipeline.
Why Manual Compliance Fails for AI Systems
Traditional software compliance works because code is deterministic. A security scan of yesterday's build is still valid today. A penetration test document from six months ago describes a known attack surface.
AI systems break this assumption. When you retrain a model on updated data, the bias profile changes. When you tune hyperparameters for accuracy, robustness against adversarial inputs may degrade. When you add a new feature to the input vector, the explainability obligation (Art.13 instructions for use) may require updated documentation.
EU AI Act Art.9(1) requires that your risk management system address risks throughout the entire lifecycle of the high-risk AI system, not just at initial placement on the market. Art.9(6) explicitly requires testing before market release and in each subsequent deployment. This lifecycle framing is what makes CI/CD integration not just convenient but legally necessary.
The Five Mandatory Compliance Gates
A robust EU AI Act CI/CD pipeline needs five verification stages, each mapped to specific legal obligations:
Gate 1: Data Governance (Art.10)
Art.10 requires that training, validation, and test datasets be managed according to defined data governance practices. Your pipeline must verify:
- Dataset provenance: Does your training data registry record source, collection date, and consent basis?
- Representativeness check: Does the dataset cover the intended geographic deployment scope? A model trained exclusively on English-language inputs that will be deployed in multilingual EU markets may fail the Art.10(3) requirement for datasets that are "sufficiently representative."
- Bias detection: Run statistical fairness metrics (demographic parity, equalized odds) across protected categories. Art.10(5) specifically requires examination of potential biases in training data that could result in discrimination.
Tools: Evidently AI for data drift and bias reporting, Fairlearn for fairness metrics, Great Expectations for dataset validation.
# .github/workflows/compliance-gate-data.yml
- name: Data Governance Gate (Art.10)
run: |
python scripts/check_dataset_provenance.py --registry data-registry.json
python scripts/run_bias_report.py --dataset $DATASET_PATH --output reports/art10-bias-$(date +%Y%m%d).html
python scripts/check_representativeness.py --regions EU --threshold 0.85
The gate fails — and the build stops — if the bias report detects a demographic parity gap above your defined threshold, or if dataset provenance cannot be verified.
Gate 2: Accuracy and Robustness (Art.15)
Art.15 requires that high-risk AI systems achieve appropriate levels of accuracy, robustness, and cybersecurity. Critically, Art.15(1) specifies that these properties must be maintained throughout the lifecycle at a consistent performance level.
Your CI/CD pipeline must define and enforce performance thresholds:
- Accuracy gate: The model must meet a minimum accuracy threshold on a held-out validation set. This threshold must be documented in your Annex IV technical documentation (Art.11), making it a compliance obligation, not just a product quality metric.
- Robustness gate: Test against distribution shift. If the model was trained on data from Q1 2025 but production data has drifted, the robustness requirement may already be violated.
- Adversarial robustness (for cybersecurity-sensitive domains): Art.15(3) requires that models in certain sectors be resilient to adversarial inputs. If your system operates in a high-risk Annex III category, include adversarial attack testing in your pipeline.
# scripts/run_accuracy_gate.py
REQUIRED_ACCURACY = 0.92 # Must match Annex IV technical documentation
REQUIRED_ROBUSTNESS = 0.88 # Performance on held-out test set with distribution shift
accuracy = evaluate_model(model, validation_set)
robustness = evaluate_model(model, shifted_validation_set)
assert accuracy >= REQUIRED_ACCURACY, f"Art.15 FAIL: accuracy {accuracy:.3f} < {REQUIRED_ACCURACY}"
assert robustness >= REQUIRED_ROBUSTNESS, f"Art.15 FAIL: robustness {robustness:.3f} < {REQUIRED_ROBUSTNESS}"
# Write compliance evidence
write_evidence_record({
"gate": "art15-accuracy-robustness",
"timestamp": datetime.utcnow().isoformat(),
"model_version": MODEL_VERSION,
"accuracy": accuracy,
"robustness": robustness,
"threshold_met": True,
"commit": os.getenv("GITHUB_SHA"),
})
The critical output here is the evidence record — a JSON artifact that becomes part of your Art.11 technical documentation. Every gate must produce machine-readable evidence, not just a pass/fail signal.
Gate 3: Logging and Record-Keeping (Art.12)
Art.12 requires that high-risk AI systems automatically log events with sufficient granularity to enable post-hoc reconstruction of system behaviour. This means your deployed model must be instrumented at build time, not just in production configuration.
In your CI/CD pipeline, this gate verifies that:
- The model artifact includes the required logging hooks
- The logging schema covers the minimum required fields (input hash, output, confidence score, timestamp, user identifier where applicable)
- Logs are routed to an append-only store (Art.12(1) implies immutability)
- Log retention is configured to match your risk management system's post-market monitoring plan (Art.72 requires continuous monitoring after deployment)
- name: Logging Configuration Gate (Art.12)
run: |
python scripts/verify_logging_hooks.py --model $MODEL_PATH
python scripts/validate_log_schema.py --schema schemas/art12-log-schema.json
python scripts/check_log_retention.py --target-days 365 --environment $DEPLOY_ENV
Failing this gate is a P0 blocker. A model without correct logging instrumentation cannot legally be deployed in the EU as a high-risk AI system.
Gate 4: Human Oversight Mechanism (Art.14)
Art.14 requires that high-risk AI systems be designed so that natural persons can effectively oversee them. This includes the ability to interrupt, override, or stop the system. Your pipeline must verify these mechanisms exist and are functional.
Human oversight verification in CI/CD is often overlooked because it feels like a UX concern rather than a testing concern. But Art.14(4) lists specific technical requirements:
- The ability to identify anomalous situations (output confidence below threshold, unusual input distribution)
- Clear displays of system outputs for human review
- Override mechanisms that remain functional even under high load
Your pipeline should run an integration test that simulates a human override:
# tests/test_human_oversight.py
def test_override_mechanism():
"""Verify Art.14 human override is always callable."""
model_service = ModelService()
# Start processing
future = model_service.predict_async(sample_input)
# Override before completion — must succeed
result = model_service.override(reason="operator_test")
assert result.success, "Art.14 FAIL: override mechanism did not respond"
assert result.audit_logged, "Art.14 FAIL: override not written to audit log"
def test_anomaly_alerting():
"""Verify system flags low-confidence outputs for human review."""
low_confidence_output = model.predict(adversarial_input)
assert low_confidence_output.requires_human_review, "Art.14 FAIL: no human review flag on low-confidence output"
Gate 5: Technical Documentation (Art.11 + Annex IV)
The final gate is often the most operationally intensive: generating and validating your Annex IV technical documentation. Annex IV lists nine mandatory elements, from general system description to design specifications, monitoring plans, and post-market data collection.
Rather than maintaining this documentation manually, treat it as a generated artifact:
# scripts/generate_annex_iv_doc.py
def generate_technical_documentation(model_version, config, test_results):
return {
"annex_iv_section_1": {
"general_description": config["system_description"],
"intended_purpose": config["intended_purpose"],
"provider_details": config["provider"],
},
"annex_iv_section_2": {
"design_specifications": extract_architecture_info(model_version),
"training_methodology": config["training_methodology"],
},
"annex_iv_section_3": {
"monitoring_measures": config["monitoring_plan"],
},
"annex_iv_section_4": {
"risk_management_measures": config["risk_management"],
},
"test_evidence": {
"accuracy": test_results["art15"],
"bias_report": test_results["art10"],
"logging_verification": test_results["art12"],
"oversight_test": test_results["art14"],
},
"generated_at": datetime.utcnow().isoformat(),
"model_commit": os.getenv("GITHUB_SHA"),
}
The generated documentation artifact is stored in your EU-compliant artifact registry (ideally on EU-hosted infrastructure — this is where hosting choice becomes a compliance choice). For sota.io deployments, artifact storage on EU-sovereign infrastructure means your technical documentation trail never crosses into CLOUD Act jurisdiction.
Integrating the Five Gates: A Sample GitHub Actions Workflow
# .github/workflows/eu-ai-act-compliance.yml
name: EU AI Act Compliance Pipeline
on:
push:
branches: [main]
pull_request:
paths:
- 'models/**'
- 'training/**'
jobs:
compliance-gates:
runs-on: ubuntu-latest
environment: eu-compliance
steps:
- uses: actions/checkout@v4
- name: Gate 1 — Data Governance (Art.10)
id: gate_data
run: python scripts/check_dataset_provenance.py && python scripts/run_bias_report.py
- name: Gate 2 — Accuracy & Robustness (Art.15)
id: gate_accuracy
run: python scripts/run_accuracy_gate.py
- name: Gate 3 — Logging Configuration (Art.12)
id: gate_logging
run: python scripts/verify_logging_hooks.py
- name: Gate 4 — Human Oversight (Art.14)
id: gate_oversight
run: pytest tests/test_human_oversight.py -v
- name: Gate 5 — Technical Documentation (Art.11)
id: gate_docs
run: python scripts/generate_annex_iv_doc.py --output artifacts/annex-iv-${{ github.sha }}.json
- name: Upload Compliance Artifacts
uses: actions/upload-artifact@v4
with:
name: eu-ai-act-compliance-${{ github.sha }}
path: |
reports/art10-bias-*.html
artifacts/annex-iv-*.json
logs/compliance-evidence-*.json
retention-days: 365 # Match post-market monitoring retention
- name: Compliance Summary
run: |
echo "## EU AI Act Compliance Report" >> $GITHUB_STEP_SUMMARY
echo "| Gate | Article | Status |" >> $GITHUB_STEP_SUMMARY
echo "|------|---------|--------|" >> $GITHUB_STEP_SUMMARY
echo "| Data Governance | Art.10 | ✅ PASS |" >> $GITHUB_STEP_SUMMARY
echo "| Accuracy/Robustness | Art.15 | ✅ PASS |" >> $GITHUB_STEP_SUMMARY
echo "| Logging Config | Art.12 | ✅ PASS |" >> $GITHUB_STEP_SUMMARY
echo "| Human Oversight | Art.14 | ✅ PASS |" >> $GITHUB_STEP_SUMMARY
echo "| Technical Docs | Art.11 | ✅ PASS |" >> $GITHUB_STEP_SUMMARY
Artifact Storage and EU Sovereignty
Compliance evidence generated in CI/CD is not just convenient documentation — it is potentially the primary evidence you present to a National Competent Authority (NCA) during a market surveillance inspection. Under Art.74, NCAs have the power to request access to technical documentation. If your artifacts are stored in a US-headquartered cloud (AWS S3, Google Cloud Storage, Azure Blob Storage), that storage is potentially subject to the CLOUD Act — meaning US authorities, not just EU regulators, could demand access to your compliance records under US law.
For EU-deployed AI systems, the recommendation is:
- Store compliance artifacts in EU-sovereign object storage (Hetzner Object Storage, Scaleway Object Storage, OVHcloud Object Storage)
- For CI/CD artifact storage, configure your runners to push to EU-hosted registries
- If using GitHub Actions hosted runners, configure artifact storage to route to EU endpoints or use self-hosted runners on EU infrastructure
The Pre-Deployment Compliance Report
Before every production deployment, your pipeline should generate a single compliance report that consolidates all gate outputs:
EU AI Act Pre-Deployment Compliance Report
==========================================
Model Version: 2.4.1
Commit: abc123def456
Pipeline Run: 2026-06-01T14:22:00Z
Environment: production-eu-west
Gate Results:
Art.10 Data Governance: PASS (bias gap: 0.023 < threshold 0.05)
Art.15 Accuracy: PASS (accuracy: 0.943 >= threshold 0.92)
Art.15 Robustness: PASS (robustness: 0.891 >= threshold 0.88)
Art.12 Logging Config: PASS (schema validated, retention: 365d)
Art.14 Human Oversight: PASS (override latency: 12ms, audit log: ✓)
Art.11 Documentation: PASS (annex-iv-abc123.json generated)
Compliance Verdict: APPROVED FOR DEPLOYMENT
Evidence Package: s3://eu-compliance-artifacts/2026-06-01/abc123def456/
Deploying to: sota.io (EU-sovereign PaaS, Hetzner Germany)
CLOUD Act exposure: None (EU-only infrastructure)
This report becomes the cover page of your Art.11 technical documentation package for this deployment.
20-Item CI/CD Compliance Checklist
Data Governance (Art.10)
- Dataset provenance recorded in version-controlled registry
- Bias detection runs on every model training run
- Demographic parity and equalized odds computed for protected attributes
- Representativeness verified for EU deployment scope
- Data quality validation (completeness, schema, outliers) in pipeline
Accuracy and Robustness (Art.15)
- Accuracy threshold defined, documented in Annex IV, enforced in pipeline
- Held-out test set is isolated from training (no data leakage check)
- Distribution shift test (performance on time-shifted or domain-shifted data)
- For cybersecurity-sensitive domains: adversarial input testing included
- Performance thresholds reviewed with each model architecture change
Logging and Record-Keeping (Art.12)
- Logging hooks verified at build time, not only in production config
- Log schema includes: timestamp, input hash, output, confidence, user ref
- Logs route to append-only storage (immutability requirement)
- Log retention set to minimum of post-market monitoring period (≥ 12 months)
Human Oversight (Art.14)
- Override mechanism test included in CI
- Low-confidence output flagging tested (threshold: document in Art.9 RMS)
- Override actions logged to audit trail automatically
- Override test runs under simulated high-load conditions
Technical Documentation (Art.11 + Annex IV)
- Annex IV document generated as a pipeline artifact
- Artifact includes link to all gate evidence (bias report, accuracy log, etc.)
- Documentation artifact stored in EU-sovereign storage
- Documentation version aligns with model version (immutable after deployment)
What Comes Next in This Series
This post covers the pipeline architecture. The remaining four posts in the series go deeper:
- Post 2/5: Bias testing methodology — what demographic parity actually means for your specific use case, and how to set defensible thresholds before an NCA inspection
- Post 3/5: Runtime monitoring and post-market surveillance (Art.72) — the compliance obligations that start AFTER deployment
- Post 4/5: Generating audit-ready technical documentation from CI artifacts — what the Annex IV document must contain and how to automate it
- Post 5/5: The complete compliance testing checklist — combining all five gates into a pre-certification readiness review for notified body submission
The August 2, 2026 deadline is 61 days away. If your high-risk AI system does not have automated compliance gates in CI/CD, you have two months to build them. That is enough time — if you start now.
sota.io is an EU-native managed PaaS built for teams that need to keep their AI compliance artifacts in EU-sovereign infrastructure. No US parent, no CLOUD Act exposure, no data leaving Hetzner Germany. Your compliance evidence stays where your regulator expects it.
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.