2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Governance CI/CD Gates: Automated Training Data Compliance Checks for High-Risk AI Pipelines

Post #4 in the sota.io EU AI Act Data Governance Sprint — August 2026 Deadline

CI/CD pipeline gates for EU AI Act Art.10 training data compliance

The EU AI Act does not describe CI/CD pipelines anywhere in its 144 articles. But for engineering teams building high-risk AI systems, the Art.10 data governance obligations translate directly into automated checks that must run before any model version is approved for deployment. If you wait until an audit to discover your training dataset lacks provenance documentation or failed bias testing, you have already violated the law.

This guide shows how to build Art.10-compliant CI/CD data governance gates: automated pipeline checks that enforce documentation, testing, and logging requirements as blocking conditions before your model reaches production.

What Art.10 Actually Requires (The Short Version)

Art.10(1) of the EU AI Act establishes that high-risk AI systems must be trained, validated, and tested on datasets subject to appropriate data governance and management practices. Art.10(2) enumerates what those practices must cover:

(a) Relevant design choices
(b) Data collection processes and the origin of data, including for personal data, the original purpose of collection
(c) Relevant data-preparation operations: annotation, labelling, cleaning, enrichment, aggregation
(d) The formulation of relevant assumptions
(e) Assessment of availability, quantity, and suitability of datasets
(f) Examination in view of possible biases that could affect health, safety, or fundamental rights
(g) Identification of data gaps or shortcomings and how they are addressed

Art.10(3) adds that datasets must be relevant, sufficiently representative, and free of errors to the extent possible.

Art.10(4) requires appropriate statistical properties, including regarding the persons or groups the system is intended to serve.

Art.10(5) permits processing of special categories of personal data for bias monitoring, detection, and correction — but only to the extent strictly necessary.

In a CI/CD context, each of these requirements maps to one or more automated checks. The goal is that no training dataset enters a production pipeline without satisfying every gate.

The Five Art.10 CI/CD Gate Categories

Gate 1 — Provenance Documentation Gate (Art.10(2)(b))

Every training dataset must have a traceable origin record. This gate checks that the dataset manifest includes required provenance fields before the pipeline proceeds.

What the gate checks:

Dataset source identifier (URL, internal registry ID, or data contract reference)
Collection date range
For personal data: original purpose of collection documented
Data controller or data provider identity
License or terms of use

Implementation — Python validation script:

import json
import sys
from pathlib import Path

REQUIRED_PROVENANCE_FIELDS = [
    "source_id",
    "collection_date_start",
    "collection_date_end",
    "data_origin_description",
    "license_identifier",
]

PERSONAL_DATA_EXTRA_FIELDS = [
    "original_collection_purpose",
    "data_controller",
    "legal_basis",
]

def validate_provenance(manifest_path: str) -> list[str]:
    with open(manifest_path) as f:
        manifest = json.load(f)

    errors = []
    provenance = manifest.get("provenance", {})

    for field in REQUIRED_PROVENANCE_FIELDS:
        if not provenance.get(field):
            errors.append(f"MISSING provenance.{field} (Art.10(2)(b))")

    if manifest.get("contains_personal_data", False):
        for field in PERSONAL_DATA_EXTRA_FIELDS:
            if not provenance.get(field):
                errors.append(
                    f"MISSING provenance.{field} required for personal data (Art.10(2)(b))"
                )

    return errors


if __name__ == "__main__":
    errors = validate_provenance(sys.argv[1])
    for e in errors:
        print(f"[FAIL] {e}")
    sys.exit(1 if errors else 0)

GitHub Actions gate:

- name: Art.10(2)(b) Provenance Gate
  run: |
    python scripts/validate_provenance.py datasets/${{ matrix.dataset }}/manifest.json

This gate must pass before any dataset is admitted into the training pipeline.

Gate 2 — Data Preparation Documentation Gate (Art.10(2)(c))

Art.10(2)(c) requires that all data preparation operations are documented: annotation, labelling, cleaning, enrichment, and aggregation. This gate checks that the operation log exists and is complete.

What the gate checks:

Operations log file present for each dataset
Each operation entry includes: operation type, operator identity (human or tool), timestamp, parameters applied
For annotation: annotator qualification records or tool version
For cleaning: outlier removal criteria documented

VALID_OPERATION_TYPES = {
    "annotation",
    "labelling",
    "cleaning",
    "enrichment",
    "aggregation",
    "filtering",
    "normalization",
    "augmentation",
}

def validate_preparation_log(ops_log_path: str) -> list[str]:
    with open(ops_log_path) as f:
        log = json.load(f)

    errors = []
    operations = log.get("operations", [])

    if not operations:
        errors.append("EMPTY operations log (Art.10(2)(c))")
        return errors

    for i, op in enumerate(operations):
        if op.get("type") not in VALID_OPERATION_TYPES:
            errors.append(f"Op[{i}]: unknown type '{op.get('type')}' (Art.10(2)(c))")
        if not op.get("operator"):
            errors.append(f"Op[{i}]: missing operator identity (Art.10(2)(c))")
        if not op.get("timestamp"):
            errors.append(f"Op[{i}]: missing timestamp (Art.10(2)(c))")
        if not op.get("parameters_documented"):
            errors.append(f"Op[{i}]: parameters_documented must be true (Art.10(2)(c))")

    return errors

If your pipeline uses DVC or MLflow, you can auto-generate this log from pipeline run artifacts and validate it in the gate.

Gate 3 — Bias and Representativeness Gate (Art.10(2)(f) and Art.10(3))

Art.10(2)(f) requires examination of datasets for possible biases that could affect health, safety, or fundamental rights. Art.10(3) requires that datasets be sufficiently representative. This is the most technically complex gate.

What the gate checks:

Bias test results file present (must be produced by a registered testing tool or script)
Key demographic dimensions tested (where applicable to the use case)
Representativeness assessment present (coverage of intended deployment population)
Any identified biases documented with proposed mitigations

def validate_bias_report(report_path: str, use_case_config: dict) -> list[str]:
    with open(report_path) as f:
        report = json.load(f)

    errors = []
    required_dimensions = use_case_config.get("bias_dimensions_required", [])

    tested_dimensions = {r["dimension"] for r in report.get("results", [])}
    missing = set(required_dimensions) - tested_dimensions
    for dim in missing:
        errors.append(f"MISSING bias test for dimension '{dim}' (Art.10(2)(f))")

    if not report.get("representativeness_assessment"):
        errors.append("MISSING representativeness_assessment (Art.10(3))")

    if not report.get("bias_findings_documented"):
        errors.append(
            "bias_findings_documented must be true — even if no biases found, "
            "document that explicitly (Art.10(2)(f))"
        )

    return errors

Use-case configuration example (use_case.json):

{
  "use_case_id": "credit-scoring-v3",
  "high_risk_category": "access-to-essential-services",
  "bias_dimensions_required": [
    "gender",
    "age_group",
    "national_origin",
    "disability_status"
  ],
  "min_representativeness_score": 0.75
}

The bias_dimensions_required list should be derived from your Art.9 Risk Management System's identified risk categories. If your RMS identifies age as a risk factor for your use case, then the bias gate must test for age-based disparities.

Gate 4 — Data Gap and Assumptions Gate (Art.10(2)(d) and Art.10(2)(g))

Art.10(2)(d) requires documentation of the assumptions made about the data. Art.10(2)(g) requires identification of data gaps and shortcomings, along with how they are addressed.

Why this gate exists: Teams often know about dataset limitations — imbalanced classes, underrepresented regions, outdated samples — but document nothing. Art.10 makes this documentation mandatory.

def validate_assumptions_and_gaps(manifest_path: str) -> list[str]:
    with open(manifest_path) as f:
        manifest = json.load(f)

    errors = []

    assumptions = manifest.get("assumptions", [])
    if not assumptions:
        errors.append(
            "MISSING assumptions list — must document at least the absence "
            "of known assumptions (Art.10(2)(d))"
        )

    gaps = manifest.get("data_gaps", {})
    if "identified_gaps" not in gaps:
        errors.append("MISSING data_gaps.identified_gaps (Art.10(2)(g))")
    if "mitigation_measures" not in gaps:
        errors.append("MISSING data_gaps.mitigation_measures (Art.10(2)(g))")

    # Validate that each identified gap has a mitigation entry
    identified = gaps.get("identified_gaps", [])
    mitigations = {m["gap_id"] for m in gaps.get("mitigation_measures", [])}
    for gap in identified:
        if gap.get("gap_id") not in mitigations:
            errors.append(
                f"Gap '{gap.get('gap_id')}' has no mitigation entry (Art.10(2)(g))"
            )

    return errors

Gate 5 — Dataset Suitability and Statistical Properties Gate (Art.10(2)(e) and Art.10(4))

Art.10(2)(e) requires assessment of the availability, quantity, and suitability of datasets. Art.10(4) requires appropriate statistical properties.

This gate runs automated statistical checks against a documented minimum threshold configuration:

import numpy as np

def validate_statistical_properties(
    stats_path: str,
    thresholds_path: str,
) -> list[str]:
    with open(stats_path) as f:
        stats = json.load(f)
    with open(thresholds_path) as f:
        thresholds = json.load(f)

    errors = []

    # Check minimum sample size
    actual_size = stats.get("total_samples", 0)
    min_size = thresholds.get("min_samples_required", 0)
    if actual_size < min_size:
        errors.append(
            f"Dataset size {actual_size} < required minimum {min_size} "
            f"(Art.10(2)(e): insufficient quantity)"
        )

    # Check class balance
    class_dist = stats.get("class_distribution", {})
    min_class_ratio = thresholds.get("min_class_ratio", 0.05)
    total = sum(class_dist.values())
    for cls, count in class_dist.items():
        ratio = count / total if total > 0 else 0
        if ratio < min_class_ratio:
            errors.append(
                f"Class '{cls}' ratio {ratio:.3f} below threshold {min_class_ratio} "
                f"(Art.10(4): statistical properties)"
            )

    # Check missing value rate
    missing_rate = stats.get("missing_value_rate", 0)
    max_missing = thresholds.get("max_missing_value_rate", 0.05)
    if missing_rate > max_missing:
        errors.append(
            f"Missing value rate {missing_rate:.3f} exceeds maximum {max_missing} "
            f"(Art.10(3): free of errors)"
        )

    return errors

Full CI/CD Pipeline Integration

Here is a complete GitHub Actions workflow that runs all five Art.10 gates sequentially. Gates are ordered by detection speed — fast checks first, expensive checks last:

name: EU AI Act Art.10 Data Governance Gates

on:
  pull_request:
    paths:
      - 'datasets/**'
      - 'training_configs/**'

jobs:
  art10-compliance:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        dataset: ${{ fromJson(needs.discover-datasets.outputs.datasets) }}

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install compliance tools
        run: pip install -r requirements-compliance.txt

      # Gate 1: Provenance (fast — JSON validation only)
      - name: Gate 1 — Art.10(2)(b) Provenance Documentation
        run: |
          python scripts/compliance/validate_provenance.py \
            datasets/${{ matrix.dataset }}/manifest.json
        # Fail fast — no point running bias tests on undocumented data

      # Gate 2: Preparation log (fast — JSON validation only)
      - name: Gate 2 — Art.10(2)(c) Data Preparation Log
        run: |
          python scripts/compliance/validate_preparation_log.py \
            datasets/${{ matrix.dataset }}/operations_log.json

      # Gate 3: Assumptions and gaps (fast — JSON validation only)
      - name: Gate 3 — Art.10(2)(d)/(g) Assumptions and Data Gaps
        run: |
          python scripts/compliance/validate_assumptions_gaps.py \
            datasets/${{ matrix.dataset }}/manifest.json

      # Gate 4: Statistical properties (medium — reads dataset stats)
      - name: Gate 4 — Art.10(2)(e)/(4) Statistical Properties
        run: |
          python scripts/compliance/validate_statistics.py \
            datasets/${{ matrix.dataset }}/stats.json \
            use_cases/${{ matrix.dataset }}/thresholds.json

      # Gate 5: Bias testing (slow — may run ML fairness checks)
      - name: Gate 5 — Art.10(2)(f)/(3) Bias and Representativeness
        run: |
          python scripts/compliance/validate_bias_report.py \
            datasets/${{ matrix.dataset }}/bias_report.json \
            use_cases/${{ matrix.dataset }}/use_case.json

      # Compliance certificate: generate Art.10 documentation artifact
      - name: Generate Art.10 Compliance Certificate
        if: success()
        run: |
          python scripts/compliance/generate_art10_certificate.py \
            datasets/${{ matrix.dataset }}/ \
            --output artifacts/art10-certificates/${{ matrix.dataset }}-$(date +%Y%m%d).json

      - name: Upload compliance artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: art10-compliance-${{ matrix.dataset }}
          path: artifacts/art10-certificates/
          retention-days: 3650  # 10-year retention per Annex IV requirements

The 10-year artifact retention is not arbitrary: Annex IV technical documentation must be kept for 10 years after the high-risk AI system is placed on the market or put into service.

The Art.10 Compliance Certificate

When all five gates pass, generate a machine-readable certificate that becomes part of your Annex IV technical documentation package:

def generate_art10_certificate(dataset_dir: str, output_path: str) -> None:
    manifest = load_json(f"{dataset_dir}/manifest.json")
    ops_log = load_json(f"{dataset_dir}/operations_log.json")
    bias_report = load_json(f"{dataset_dir}/bias_report.json")

    certificate = {
        "certificate_type": "EU_AI_ACT_ART10_DATA_GOVERNANCE",
        "regulation_version": "EU 2024/1689",
        "generated_at": datetime.utcnow().isoformat() + "Z",
        "dataset_id": manifest["dataset_id"],
        "gates_passed": {
            "art10_2b_provenance": True,
            "art10_2c_preparation_log": True,
            "art10_2d_assumptions": True,
            "art10_2e_suitability": True,
            "art10_2f_bias_testing": True,
            "art10_2g_gaps_documented": True,
            "art10_3_representativeness": True,
            "art10_4_statistical_properties": True,
        },
        "certificate_hash": compute_sha256(manifest, ops_log, bias_report),
        "valid_for_training_run": True,
    }

    with open(output_path, "w") as f:
        json.dump(certificate, f, indent=2)

This certificate should be:

Committed to your compliance repository alongside the model version it authorizes
Referenced in your Annex IV documentation as evidence that Art.10 requirements were met
Archived for 10 years per Annex IV retention requirements
Linked to the specific model version trained on this dataset

Connecting to Your Art.9 Risk Management System

Art.10 compliance does not exist in isolation. Your Art.9 Risk Management System (RMS) defines which risk categories are material for your use case — and those risk categories must flow directly into your Art.10 bias gate configuration.

Specifically:

The bias_dimensions_required in Gate 3 must be derived from your RMS's identified fundamental rights risk areas
If your RMS identifies gender discrimination as a risk for your use case, the bias gate must test for gender-based disparities
If a bias test finds a disparity that exceeds your risk threshold, your RMS must be updated to reflect the new risk finding

This creates a compliance feedback loop: RMS → Art.10 gate config → bias test results → RMS update.

What Happens When a Gate Fails

A failing gate is not just a CI check failure — it is a documented compliance finding that must be addressed before the dataset is used for training. When a gate fails:

Block the training run — do not proceed to model training with non-compliant data
Create a compliance finding record — log the failure with timestamp, gate ID, dataset ID, and specific violation
Assign ownership — route the finding to the data owner for resolution
Require sign-off before re-run — after the underlying issue is fixed, a designated compliance reviewer must approve re-running the gate

This mirrors how software security gates work: a critical vulnerability finding blocks deployment until it is resolved and reviewed.

Handling Exceptions Without Bypassing Art.10

Real-world data pipelines sometimes need to use datasets that cannot fully satisfy every Art.10 requirement — for example, historical datasets collected before Art.10 documentation requirements existed.

The EU AI Act does not provide a blanket exception for legacy data, but Art.10 requires that gaps be identified and addressed (Art.10(2)(g)). For legacy datasets, "addressed" can mean:

Documenting the gap explicitly in the data gaps section
Conducting supplementary bias testing to compensate for missing provenance records
Defining a time-bounded remediation plan for backfilling documentation
Getting sign-off from your compliance officer before using the dataset

Your CI/CD gates should support an exception mode that allows a dataset to proceed with documented exceptions, but requires:

Human approval (not automated bypass)
Exception reason and expiry date
Compensating control description

def check_exception_validity(manifest_path: str) -> bool:
    manifest = load_json(manifest_path)
    exception = manifest.get("compliance_exception")

    if not exception:
        return False

    # Exception requires explicit human approval
    if not exception.get("approved_by"):
        return False

    # Exception must not be expired
    expiry = datetime.fromisoformat(exception.get("expires_at", "1970-01-01"))
    if expiry < datetime.utcnow():
        return False

    return True

Checklist: Art.10 CI/CD Gates Before August 2026

For each training dataset used by a high-risk AI system:

Gate 1: Provenance manifest present — source, collection dates, license
Gate 1: For personal data — original collection purpose documented
Gate 2: Operations log present with annotation, cleaning, preparation steps
Gate 3: Assumptions list documented (even if empty)
Gate 3: Data gaps identified and mitigations documented
Gate 4: Statistical properties validated against use-case thresholds
Gate 5: Bias report present for all required demographic dimensions
Gate 5: Representativeness assessment present
Art.10 compliance certificate generated and archived
Certificate linked to specific model version in Annex IV documentation
Compliance artifacts set to 10-year retention
RMS bias dimensions aligned with Art.10 gate configuration

What Comes Next

This is Post #4 in the Data Governance Sprint. The final post covers the complete Art.10 compliance checklist — a consolidated reference that combines provenance, bias testing, preparation documentation, and CI/CD gates into a single go/no-go framework you can use before the August 2026 deadline.

If you are building high-risk AI systems and want to discuss how to structure Art.10-compliant data pipelines on EU infrastructure, sota.io provides EU-native managed hosting with no CLOUD Act exposure — your compliance artifacts stay under EU jurisdiction from the start.

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing