2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Governance CI/CD Gates: Automated Training Data Compliance Checks for High-Risk AI Pipelines

Post #4 in the sota.io EU AI Act Data Governance Sprint — August 2026 Deadline

CI/CD pipeline gates for EU AI Act Art.10 training data compliance

The EU AI Act does not describe CI/CD pipelines anywhere in its 144 articles. But for engineering teams building high-risk AI systems, the Art.10 data governance obligations translate directly into automated checks that must run before any model version is approved for deployment. If you wait until an audit to discover your training dataset lacks provenance documentation or failed bias testing, you have already violated the law.

This guide shows how to build Art.10-compliant CI/CD data governance gates: automated pipeline checks that enforce documentation, testing, and logging requirements as blocking conditions before your model reaches production.

What Art.10 Actually Requires (The Short Version)

Art.10(1) of the EU AI Act establishes that high-risk AI systems must be trained, validated, and tested on datasets subject to appropriate data governance and management practices. Art.10(2) enumerates what those practices must cover:

Art.10(3) adds that datasets must be relevant, sufficiently representative, and free of errors to the extent possible.

Art.10(4) requires appropriate statistical properties, including regarding the persons or groups the system is intended to serve.

Art.10(5) permits processing of special categories of personal data for bias monitoring, detection, and correction — but only to the extent strictly necessary.

In a CI/CD context, each of these requirements maps to one or more automated checks. The goal is that no training dataset enters a production pipeline without satisfying every gate.

The Five Art.10 CI/CD Gate Categories

Gate 1 — Provenance Documentation Gate (Art.10(2)(b))

Every training dataset must have a traceable origin record. This gate checks that the dataset manifest includes required provenance fields before the pipeline proceeds.

What the gate checks:

Implementation — Python validation script:

import json
import sys
from pathlib import Path

REQUIRED_PROVENANCE_FIELDS = [
    "source_id",
    "collection_date_start",
    "collection_date_end",
    "data_origin_description",
    "license_identifier",
]

PERSONAL_DATA_EXTRA_FIELDS = [
    "original_collection_purpose",
    "data_controller",
    "legal_basis",
]

def validate_provenance(manifest_path: str) -> list[str]:
    with open(manifest_path) as f:
        manifest = json.load(f)

    errors = []
    provenance = manifest.get("provenance", {})

    for field in REQUIRED_PROVENANCE_FIELDS:
        if not provenance.get(field):
            errors.append(f"MISSING provenance.{field} (Art.10(2)(b))")

    if manifest.get("contains_personal_data", False):
        for field in PERSONAL_DATA_EXTRA_FIELDS:
            if not provenance.get(field):
                errors.append(
                    f"MISSING provenance.{field} required for personal data (Art.10(2)(b))"
                )

    return errors


if __name__ == "__main__":
    errors = validate_provenance(sys.argv[1])
    for e in errors:
        print(f"[FAIL] {e}")
    sys.exit(1 if errors else 0)

GitHub Actions gate:

- name: Art.10(2)(b) Provenance Gate
  run: |
    python scripts/validate_provenance.py datasets/${{ matrix.dataset }}/manifest.json

This gate must pass before any dataset is admitted into the training pipeline.

Gate 2 — Data Preparation Documentation Gate (Art.10(2)(c))

Art.10(2)(c) requires that all data preparation operations are documented: annotation, labelling, cleaning, enrichment, and aggregation. This gate checks that the operation log exists and is complete.

What the gate checks:

VALID_OPERATION_TYPES = {
    "annotation",
    "labelling",
    "cleaning",
    "enrichment",
    "aggregation",
    "filtering",
    "normalization",
    "augmentation",
}

def validate_preparation_log(ops_log_path: str) -> list[str]:
    with open(ops_log_path) as f:
        log = json.load(f)

    errors = []
    operations = log.get("operations", [])

    if not operations:
        errors.append("EMPTY operations log (Art.10(2)(c))")
        return errors

    for i, op in enumerate(operations):
        if op.get("type") not in VALID_OPERATION_TYPES:
            errors.append(f"Op[{i}]: unknown type '{op.get('type')}' (Art.10(2)(c))")
        if not op.get("operator"):
            errors.append(f"Op[{i}]: missing operator identity (Art.10(2)(c))")
        if not op.get("timestamp"):
            errors.append(f"Op[{i}]: missing timestamp (Art.10(2)(c))")
        if not op.get("parameters_documented"):
            errors.append(f"Op[{i}]: parameters_documented must be true (Art.10(2)(c))")

    return errors

If your pipeline uses DVC or MLflow, you can auto-generate this log from pipeline run artifacts and validate it in the gate.

Gate 3 — Bias and Representativeness Gate (Art.10(2)(f) and Art.10(3))

Art.10(2)(f) requires examination of datasets for possible biases that could affect health, safety, or fundamental rights. Art.10(3) requires that datasets be sufficiently representative. This is the most technically complex gate.

What the gate checks:

def validate_bias_report(report_path: str, use_case_config: dict) -> list[str]:
    with open(report_path) as f:
        report = json.load(f)

    errors = []
    required_dimensions = use_case_config.get("bias_dimensions_required", [])

    tested_dimensions = {r["dimension"] for r in report.get("results", [])}
    missing = set(required_dimensions) - tested_dimensions
    for dim in missing:
        errors.append(f"MISSING bias test for dimension '{dim}' (Art.10(2)(f))")

    if not report.get("representativeness_assessment"):
        errors.append("MISSING representativeness_assessment (Art.10(3))")

    if not report.get("bias_findings_documented"):
        errors.append(
            "bias_findings_documented must be true — even if no biases found, "
            "document that explicitly (Art.10(2)(f))"
        )

    return errors

Use-case configuration example (use_case.json):

{
  "use_case_id": "credit-scoring-v3",
  "high_risk_category": "access-to-essential-services",
  "bias_dimensions_required": [
    "gender",
    "age_group",
    "national_origin",
    "disability_status"
  ],
  "min_representativeness_score": 0.75
}

The bias_dimensions_required list should be derived from your Art.9 Risk Management System's identified risk categories. If your RMS identifies age as a risk factor for your use case, then the bias gate must test for age-based disparities.

Gate 4 — Data Gap and Assumptions Gate (Art.10(2)(d) and Art.10(2)(g))

Art.10(2)(d) requires documentation of the assumptions made about the data. Art.10(2)(g) requires identification of data gaps and shortcomings, along with how they are addressed.

Why this gate exists: Teams often know about dataset limitations — imbalanced classes, underrepresented regions, outdated samples — but document nothing. Art.10 makes this documentation mandatory.

def validate_assumptions_and_gaps(manifest_path: str) -> list[str]:
    with open(manifest_path) as f:
        manifest = json.load(f)

    errors = []

    assumptions = manifest.get("assumptions", [])
    if not assumptions:
        errors.append(
            "MISSING assumptions list — must document at least the absence "
            "of known assumptions (Art.10(2)(d))"
        )

    gaps = manifest.get("data_gaps", {})
    if "identified_gaps" not in gaps:
        errors.append("MISSING data_gaps.identified_gaps (Art.10(2)(g))")
    if "mitigation_measures" not in gaps:
        errors.append("MISSING data_gaps.mitigation_measures (Art.10(2)(g))")

    # Validate that each identified gap has a mitigation entry
    identified = gaps.get("identified_gaps", [])
    mitigations = {m["gap_id"] for m in gaps.get("mitigation_measures", [])}
    for gap in identified:
        if gap.get("gap_id") not in mitigations:
            errors.append(
                f"Gap '{gap.get('gap_id')}' has no mitigation entry (Art.10(2)(g))"
            )

    return errors

Gate 5 — Dataset Suitability and Statistical Properties Gate (Art.10(2)(e) and Art.10(4))

Art.10(2)(e) requires assessment of the availability, quantity, and suitability of datasets. Art.10(4) requires appropriate statistical properties.

This gate runs automated statistical checks against a documented minimum threshold configuration:

import numpy as np

def validate_statistical_properties(
    stats_path: str,
    thresholds_path: str,
) -> list[str]:
    with open(stats_path) as f:
        stats = json.load(f)
    with open(thresholds_path) as f:
        thresholds = json.load(f)

    errors = []

    # Check minimum sample size
    actual_size = stats.get("total_samples", 0)
    min_size = thresholds.get("min_samples_required", 0)
    if actual_size < min_size:
        errors.append(
            f"Dataset size {actual_size} < required minimum {min_size} "
            f"(Art.10(2)(e): insufficient quantity)"
        )

    # Check class balance
    class_dist = stats.get("class_distribution", {})
    min_class_ratio = thresholds.get("min_class_ratio", 0.05)
    total = sum(class_dist.values())
    for cls, count in class_dist.items():
        ratio = count / total if total > 0 else 0
        if ratio < min_class_ratio:
            errors.append(
                f"Class '{cls}' ratio {ratio:.3f} below threshold {min_class_ratio} "
                f"(Art.10(4): statistical properties)"
            )

    # Check missing value rate
    missing_rate = stats.get("missing_value_rate", 0)
    max_missing = thresholds.get("max_missing_value_rate", 0.05)
    if missing_rate > max_missing:
        errors.append(
            f"Missing value rate {missing_rate:.3f} exceeds maximum {max_missing} "
            f"(Art.10(3): free of errors)"
        )

    return errors

Full CI/CD Pipeline Integration

Here is a complete GitHub Actions workflow that runs all five Art.10 gates sequentially. Gates are ordered by detection speed — fast checks first, expensive checks last:

name: EU AI Act Art.10 Data Governance Gates

on:
  pull_request:
    paths:
      - 'datasets/**'
      - 'training_configs/**'

jobs:
  art10-compliance:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        dataset: ${{ fromJson(needs.discover-datasets.outputs.datasets) }}

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install compliance tools
        run: pip install -r requirements-compliance.txt

      # Gate 1: Provenance (fast — JSON validation only)
      - name: Gate 1 — Art.10(2)(b) Provenance Documentation
        run: |
          python scripts/compliance/validate_provenance.py \
            datasets/${{ matrix.dataset }}/manifest.json
        # Fail fast — no point running bias tests on undocumented data

      # Gate 2: Preparation log (fast — JSON validation only)
      - name: Gate 2 — Art.10(2)(c) Data Preparation Log
        run: |
          python scripts/compliance/validate_preparation_log.py \
            datasets/${{ matrix.dataset }}/operations_log.json

      # Gate 3: Assumptions and gaps (fast — JSON validation only)
      - name: Gate 3 — Art.10(2)(d)/(g) Assumptions and Data Gaps
        run: |
          python scripts/compliance/validate_assumptions_gaps.py \
            datasets/${{ matrix.dataset }}/manifest.json

      # Gate 4: Statistical properties (medium — reads dataset stats)
      - name: Gate 4 — Art.10(2)(e)/(4) Statistical Properties
        run: |
          python scripts/compliance/validate_statistics.py \
            datasets/${{ matrix.dataset }}/stats.json \
            use_cases/${{ matrix.dataset }}/thresholds.json

      # Gate 5: Bias testing (slow — may run ML fairness checks)
      - name: Gate 5 — Art.10(2)(f)/(3) Bias and Representativeness
        run: |
          python scripts/compliance/validate_bias_report.py \
            datasets/${{ matrix.dataset }}/bias_report.json \
            use_cases/${{ matrix.dataset }}/use_case.json

      # Compliance certificate: generate Art.10 documentation artifact
      - name: Generate Art.10 Compliance Certificate
        if: success()
        run: |
          python scripts/compliance/generate_art10_certificate.py \
            datasets/${{ matrix.dataset }}/ \
            --output artifacts/art10-certificates/${{ matrix.dataset }}-$(date +%Y%m%d).json

      - name: Upload compliance artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: art10-compliance-${{ matrix.dataset }}
          path: artifacts/art10-certificates/
          retention-days: 3650  # 10-year retention per Annex IV requirements

The 10-year artifact retention is not arbitrary: Annex IV technical documentation must be kept for 10 years after the high-risk AI system is placed on the market or put into service.

The Art.10 Compliance Certificate

When all five gates pass, generate a machine-readable certificate that becomes part of your Annex IV technical documentation package:

def generate_art10_certificate(dataset_dir: str, output_path: str) -> None:
    manifest = load_json(f"{dataset_dir}/manifest.json")
    ops_log = load_json(f"{dataset_dir}/operations_log.json")
    bias_report = load_json(f"{dataset_dir}/bias_report.json")

    certificate = {
        "certificate_type": "EU_AI_ACT_ART10_DATA_GOVERNANCE",
        "regulation_version": "EU 2024/1689",
        "generated_at": datetime.utcnow().isoformat() + "Z",
        "dataset_id": manifest["dataset_id"],
        "gates_passed": {
            "art10_2b_provenance": True,
            "art10_2c_preparation_log": True,
            "art10_2d_assumptions": True,
            "art10_2e_suitability": True,
            "art10_2f_bias_testing": True,
            "art10_2g_gaps_documented": True,
            "art10_3_representativeness": True,
            "art10_4_statistical_properties": True,
        },
        "certificate_hash": compute_sha256(manifest, ops_log, bias_report),
        "valid_for_training_run": True,
    }

    with open(output_path, "w") as f:
        json.dump(certificate, f, indent=2)

This certificate should be:

Connecting to Your Art.9 Risk Management System

Art.10 compliance does not exist in isolation. Your Art.9 Risk Management System (RMS) defines which risk categories are material for your use case — and those risk categories must flow directly into your Art.10 bias gate configuration.

Specifically:

This creates a compliance feedback loop: RMS → Art.10 gate config → bias test results → RMS update.

What Happens When a Gate Fails

A failing gate is not just a CI check failure — it is a documented compliance finding that must be addressed before the dataset is used for training. When a gate fails:

  1. Block the training run — do not proceed to model training with non-compliant data
  2. Create a compliance finding record — log the failure with timestamp, gate ID, dataset ID, and specific violation
  3. Assign ownership — route the finding to the data owner for resolution
  4. Require sign-off before re-run — after the underlying issue is fixed, a designated compliance reviewer must approve re-running the gate

This mirrors how software security gates work: a critical vulnerability finding blocks deployment until it is resolved and reviewed.

Handling Exceptions Without Bypassing Art.10

Real-world data pipelines sometimes need to use datasets that cannot fully satisfy every Art.10 requirement — for example, historical datasets collected before Art.10 documentation requirements existed.

The EU AI Act does not provide a blanket exception for legacy data, but Art.10 requires that gaps be identified and addressed (Art.10(2)(g)). For legacy datasets, "addressed" can mean:

Your CI/CD gates should support an exception mode that allows a dataset to proceed with documented exceptions, but requires:

def check_exception_validity(manifest_path: str) -> bool:
    manifest = load_json(manifest_path)
    exception = manifest.get("compliance_exception")

    if not exception:
        return False

    # Exception requires explicit human approval
    if not exception.get("approved_by"):
        return False

    # Exception must not be expired
    expiry = datetime.fromisoformat(exception.get("expires_at", "1970-01-01"))
    if expiry < datetime.utcnow():
        return False

    return True

Checklist: Art.10 CI/CD Gates Before August 2026

For each training dataset used by a high-risk AI system:

What Comes Next

This is Post #4 in the Data Governance Sprint. The final post covers the complete Art.10 compliance checklist — a consolidated reference that combines provenance, bias testing, preparation documentation, and CI/CD gates into a single go/no-go framework you can use before the August 2026 deadline.

If you are building high-risk AI systems and want to discuss how to structure Art.10-compliant data pipelines on EU infrastructure, sota.io provides EU-native managed hosting with no CLOUD Act exposure — your compliance artifacts stay under EU jurisdiction from the start.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.