2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Provenance Logging: Tracking Training Data Origin, Modifications, and Governance Records

Post #3 in the sota.io EU AI Act Data Governance Sprint Series

EU AI Act Art.10 data provenance logging and training data lineage tracking

When an EU AI Act market surveillance authority audits your high-risk AI system, one of the first questions is: where did your training data come from, and can you prove how it was modified before training?

Article 10 of Regulation (EU) 2024/1689 does not treat training data as a black box. It requires providers to document the entire lifecycle of training, validation, and testing data — from original source through every transformation to final dataset composition. This is data provenance, and the EU AI Act makes it a legal obligation for high-risk AI systems.

This post covers what Art.10 requires for data provenance and origin documentation, what records must be maintained, and how to build a provenance-logging pipeline that satisfies both the regulation and real-world audit requirements.


What Art.10 Actually Requires for Data Provenance

Article 10(2) of Regulation (EU) 2024/1689 specifies the data governance practices that must cover training, validation, and testing data. Two sub-provisions directly govern provenance:

Art.10(2)(b) — Data collection and origin:

"the relevant data collection processes and the origin of data"

This requires documentation of how data was collected and where it came from. For each training dataset component, providers must record:

Art.10(2)(c) — Data preparation operations:

"the relevant data preparation processing operations, such as annotation, labelling, cleaning, enrichment and aggregation"

Every transformation applied to raw data before it enters the training pipeline must be documented. This includes:

Together, Art.10(2)(b) and Art.10(2)(c) create a requirement for end-to-end data lineage — a traceable record from raw source to finalized training split.


Why "Data Provenance" Is More Than a Spreadsheet

Many teams treat data documentation as a one-page description filed with the technical documentation. This misunderstands what Art.10 requires.

The regulation specifically links provenance to compliance outcomes. Article 10(5) requires examination of data for biases — and that examination requires knowing what the data contains and where it came from. Article 10(3)'s requirement for data that is "free of errors" cannot be demonstrated without records of what cleaning steps were applied. Article 10(4)'s geographic and contextual representativeness test requires knowing the geographic origin of training samples.

Provenance records serve three concrete compliance functions:

  1. Audit response: When a national market surveillance authority requests evidence of data governance, provenance logs are the primary artifact. A retrospective "we believe we used dataset X" is not sufficient.

  2. Bias root-cause analysis: If a post-deployment bias incident occurs, provenance records allow you to trace whether the problem originated in the collection process, the labelling instructions, or the aggregation step.

  3. Dataset update governance: When training data is refreshed, provenance records for the previous version establish the baseline for change impact assessment.


What Records Must Be Maintained

Article 10 does not prescribe a specific format, but the records must support the technical documentation requirements of Annex IV. Based on the obligation structure in Art.10(2), compliant provenance records cover the following:

Dataset Registry

A master registry listing every dataset component used in training, validation, and testing, with:

FieldDescription
Dataset IDUnique identifier for internal reference
Name / versionHuman-readable label and version string
Source typePublic / licensed / proprietary / synthetic / third-party
Source URL or supplierReference to the original location or provider contract
Collection date rangeFirst and last date of data collection
Record countNumber of records included from this source
License / legal basisGDPR legal basis if personal data; licensing terms otherwise
Last reviewedDate of most recent provenance review

Preparation Operations Log

For each dataset component, a chronological log of every preparation step:

StepOperationOperatorDateInput countOutput countParameters
1DeduplicationETL pipeline v2.32025-11-012,450,0002,310,000Jaccard similarity > 0.95
2Language filteringlangdetect v1.0.92025-11-022,310,0002,198,000EN/DE/FR only
3PII redactionpresidio v2.2.02025-11-032,198,0002,198,000PERSON, EMAIL, PHONE
4Labelling (Round 1)Contractor team A2025-11-0550,000 sample50,000Guidelines v1.4
5Label QAInternal review2025-11-1250,00049,200IAA threshold 0.82

Annotation Documentation

For supervised learning tasks, annotation records must include:

Aggregation and Sampling Records

When multiple dataset components are combined:


Building a Provenance-Logging Pipeline

Most high-risk AI teams will need to integrate provenance logging into an existing data engineering workflow. The key principle is that provenance should be generated automatically as data flows through the pipeline, not assembled manually after training completes.

Stage 1: Source Registration

At the point of data ingestion, every new dataset source triggers the creation of a provenance record. This can be implemented as a registration step in your ETL or data warehouse layer:

from dataclasses import dataclass, asdict
from datetime import date
import json, hashlib

@dataclass
class DatasetSource:
    dataset_id: str
    name: str
    version: str
    source_type: str          # "public" | "licensed" | "proprietary" | "synthetic"
    source_url: str
    collection_date_start: str
    collection_date_end: str
    license: str
    legal_basis: str          # GDPR Art.6 basis if personal data
    record_count: int

def register_source(source: DatasetSource, registry_path: str):
    with open(f"{registry_path}/{source.dataset_id}.json", "w") as f:
        json.dump(asdict(source), f, indent=2)

Stage 2: Operation Logging

Each preparation step emits a provenance event. A lightweight event log stores the operation type, parameters, and record counts before and after:

import json, time
from pathlib import Path

def log_operation(dataset_id: str, operation: str, operator: str,
                  params: dict, input_count: int, output_count: int,
                  log_path: str):
    event = {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "dataset_id": dataset_id,
        "operation": operation,
        "operator": operator,
        "params": params,
        "input_count": input_count,
        "output_count": output_count,
        "records_removed": input_count - output_count,
    }
    with open(log_path, "a") as f:
        f.write(json.dumps(event) + "\n")

Stage 3: Dataset Snapshot Hashing

When a dataset version is finalized for training, generate a cryptographic hash of the dataset contents. This serves two purposes: it proves the training dataset was not modified after documentation was created, and it enables exact reproduction of the training run.

import hashlib, json
from pathlib import Path

def snapshot_dataset(dataset_path: str, metadata: dict) -> dict:
    h = hashlib.sha256()
    for filepath in sorted(Path(dataset_path).rglob("*")):
        if filepath.is_file():
            h.update(filepath.read_bytes())
    snapshot = {
        "sha256": h.hexdigest(),
        "dataset_path": dataset_path,
        "snapshot_ts": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "metadata": metadata,
    }
    return snapshot

Store these snapshots in your version-controlled documentation repository alongside the technical documentation.

Stage 4: Lineage Graph

For complex pipelines with multiple input datasets, a lineage graph makes the relationships queryable and auditable. Tools like DVC (open source, MIT) or Apache Atlas provide lineage tracking with API access:

# dvc.yaml — lineage declared as pipeline stages
stages:
  ingest_public_corpus:
    cmd: python scripts/ingest.py --source public_corpus_v3
    deps:
      - scripts/ingest.py
    outs:
      - data/raw/public_corpus_v3.parquet

  clean_public_corpus:
    cmd: python scripts/clean.py --input data/raw/public_corpus_v3.parquet
    deps:
      - scripts/clean.py
      - data/raw/public_corpus_v3.parquet
    outs:
      - data/processed/public_corpus_v3_clean.parquet

  assemble_training_set:
    cmd: python scripts/assemble.py --config config/dataset_mix_v2.yaml
    deps:
      - scripts/assemble.py
      - data/processed/
    outs:
      - data/final/training_v7.parquet

DVC tracks the full dependency graph and stores checksums at each stage, giving you an auditable lineage record without additional infrastructure.


Connecting Provenance to Other Art.10 Obligations

Data provenance records are not standalone — they feed directly into the other Art.10 compliance obligations covered in this series:

Art.10(3) — Errors and completeness: Provenance logs demonstrate what cleaning and error-correction steps were applied. Without this record, the claim that data is "free of errors" cannot be substantiated.

Art.10(5) — Bias examination: You cannot examine training data for biases without knowing its origin and demographic composition. Provenance records are the prerequisite for the bias auditing process covered in Post #2 of this series.

Art.12 — Record-keeping and logging: Art.12 requires automatic logging of high-risk AI system events. Data provenance records complement the runtime logs by covering the pre-training phase that Art.12 does not reach.

Annex IV — Technical documentation: Section 2(f) of Annex IV requires the technical documentation to include "a general description of the training data and methodologies". Provenance records are the source material for this section.


Third-Party Training Data

If your training data comes from a third-party supplier — a licensed corpus, a data broker, a pre-trained model — Art.10 still applies. You are responsible for:

Contractual data governance clauses with training data suppliers are advisable: require that suppliers maintain Art.10-compatible provenance records and provide access to them in the event of an audit.

For pre-trained models used as a base for fine-tuning, document the model card's stated training data provenance and note any gaps between what is documented and what Art.10 would require if you had collected that data yourself.


Records Retention

Article 10 does not specify a retention period for provenance records. However, Annex IV technical documentation must be retained for 10 years after the last high-risk AI system placing on the market. Provenance records are part of this technical documentation and should be treated with the same retention schedule.

Practical implication: if your training dataset was assembled in 2025 for a system placed on the market in 2026, those provenance records must be retained until at least 2036.


Checklist: Art.10 Data Provenance Compliance


What Comes Next in This Series

Post #4 covers how to build automated data governance CI/CD gates — checks that run on every training data refresh to verify that provenance records are complete and the dataset still meets Art.10(3) requirements before a new training run can begin.

Post #5 will assemble the complete Art.10 compliance checklist and training data documentation package for Annex IV submission.

The August 2, 2026 deadline applies to all providers who have placed or are placing high-risk AI systems on the EU market. Data governance documentation must be complete at the time of placing the system on the market — not at the time of first audit.


This post is part of the sota.io EU AI Act Data Governance Sprint series covering Article 10 compliance for high-risk AI providers. Read Post #1: Training Data Documentation Requirements | Read Post #2: Dataset Diversity and Bias Testing

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.