2026-06-03·5 min read·sota.io Team

EU AI Act Art.10 Data Provenance Logging: Tracking Training Data Origin, Modifications, and Governance Records

Post #3 in the sota.io EU AI Act Data Governance Sprint Series

EU AI Act Art.10 data provenance logging and training data lineage tracking

When an EU AI Act market surveillance authority audits your high-risk AI system, one of the first questions is: where did your training data come from, and can you prove how it was modified before training?

Article 10 of Regulation (EU) 2024/1689 does not treat training data as a black box. It requires providers to document the entire lifecycle of training, validation, and testing data — from original source through every transformation to final dataset composition. This is data provenance, and the EU AI Act makes it a legal obligation for high-risk AI systems.

This post covers what Art.10 requires for data provenance and origin documentation, what records must be maintained, and how to build a provenance-logging pipeline that satisfies both the regulation and real-world audit requirements.

What Art.10 Actually Requires for Data Provenance

Article 10(2) of Regulation (EU) 2024/1689 specifies the data governance practices that must cover training, validation, and testing data. Two sub-provisions directly govern provenance:

Art.10(2)(b) — Data collection and origin:

"the relevant data collection processes and the origin of data"

This requires documentation of how data was collected and where it came from. For each training dataset component, providers must record:

The original source (public dataset, licensed corpus, proprietary collection, synthetic generation, third-party supplier)
The collection methodology (web scraping, survey, sensor recording, labelling contract, API pull)
The date range of collection
Any licensing, consent, or legal basis under which the data was obtained
Whether the source has changed or been updated since initial collection

Art.10(2)(c) — Data preparation operations:

"the relevant data preparation processing operations, such as annotation, labelling, cleaning, enrichment and aggregation"

Every transformation applied to raw data before it enters the training pipeline must be documented. This includes:

Annotation and labelling: who labelled the data, what annotation guidelines were used, inter-annotator agreement metrics
Cleaning: which records were removed and why (duplicates, out-of-distribution, corrupted, offensive content)
Enrichment: what features were added from secondary sources
Aggregation: how multiple datasets were combined, what weighting or sampling was applied
Filtering: what criteria excluded certain records from the final training split

Together, Art.10(2)(b) and Art.10(2)(c) create a requirement for end-to-end data lineage — a traceable record from raw source to finalized training split.

Why "Data Provenance" Is More Than a Spreadsheet

Many teams treat data documentation as a one-page description filed with the technical documentation. This misunderstands what Art.10 requires.

The regulation specifically links provenance to compliance outcomes. Article 10(5) requires examination of data for biases — and that examination requires knowing what the data contains and where it came from. Article 10(3)'s requirement for data that is "free of errors" cannot be demonstrated without records of what cleaning steps were applied. Article 10(4)'s geographic and contextual representativeness test requires knowing the geographic origin of training samples.

Provenance records serve three concrete compliance functions:

Audit response: When a national market surveillance authority requests evidence of data governance, provenance logs are the primary artifact. A retrospective "we believe we used dataset X" is not sufficient.
Bias root-cause analysis: If a post-deployment bias incident occurs, provenance records allow you to trace whether the problem originated in the collection process, the labelling instructions, or the aggregation step.
Dataset update governance: When training data is refreshed, provenance records for the previous version establish the baseline for change impact assessment.

What Records Must Be Maintained

Article 10 does not prescribe a specific format, but the records must support the technical documentation requirements of Annex IV. Based on the obligation structure in Art.10(2), compliant provenance records cover the following:

Dataset Registry

A master registry listing every dataset component used in training, validation, and testing, with:

Field	Description
Dataset ID	Unique identifier for internal reference
Name / version	Human-readable label and version string
Source type	Public / licensed / proprietary / synthetic / third-party
Source URL or supplier	Reference to the original location or provider contract
Collection date range	First and last date of data collection
Record count	Number of records included from this source
License / legal basis	GDPR legal basis if personal data; licensing terms otherwise
Last reviewed	Date of most recent provenance review

Preparation Operations Log

For each dataset component, a chronological log of every preparation step:

Step	Operation	Operator	Date	Input count	Output count	Parameters
1	Deduplication	ETL pipeline v2.3	2025-11-01	2,450,000	2,310,000	Jaccard similarity > 0.95
2	Language filtering	langdetect v1.0.9	2025-11-02	2,310,000	2,198,000	EN/DE/FR only
3	PII redaction	presidio v2.2.0	2025-11-03	2,198,000	2,198,000	PERSON, EMAIL, PHONE
4	Labelling (Round 1)	Contractor team A	2025-11-05	50,000 sample	50,000	Guidelines v1.4
5	Label QA	Internal review	2025-11-12	50,000	49,200	IAA threshold 0.82

Annotation Documentation

For supervised learning tasks, annotation records must include:

Annotation guidelines version and effective date
Annotator identities or anonymized contractor group identifiers
Inter-annotator agreement (IAA) scores per label category
Adjudication process for disagreements
Samples excluded due to low-confidence annotations

Aggregation and Sampling Records

When multiple dataset components are combined:

The mixing ratio applied (e.g., 60% corpus A, 30% corpus B, 10% synthetic)
Rationale for the chosen ratio
Any upsampling or downsampling applied to balance demographic groups
Train/validation/test split methodology and seed values

Building a Provenance-Logging Pipeline

Most high-risk AI teams will need to integrate provenance logging into an existing data engineering workflow. The key principle is that provenance should be generated automatically as data flows through the pipeline, not assembled manually after training completes.

Stage 1: Source Registration

At the point of data ingestion, every new dataset source triggers the creation of a provenance record. This can be implemented as a registration step in your ETL or data warehouse layer:

from dataclasses import dataclass, asdict
from datetime import date
import json, hashlib

@dataclass
class DatasetSource:
    dataset_id: str
    name: str
    version: str
    source_type: str          # "public" | "licensed" | "proprietary" | "synthetic"
    source_url: str
    collection_date_start: str
    collection_date_end: str
    license: str
    legal_basis: str          # GDPR Art.6 basis if personal data
    record_count: int

def register_source(source: DatasetSource, registry_path: str):
    with open(f"{registry_path}/{source.dataset_id}.json", "w") as f:
        json.dump(asdict(source), f, indent=2)

Stage 2: Operation Logging

Each preparation step emits a provenance event. A lightweight event log stores the operation type, parameters, and record counts before and after:

import json, time
from pathlib import Path

def log_operation(dataset_id: str, operation: str, operator: str,
                  params: dict, input_count: int, output_count: int,
                  log_path: str):
    event = {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "dataset_id": dataset_id,
        "operation": operation,
        "operator": operator,
        "params": params,
        "input_count": input_count,
        "output_count": output_count,
        "records_removed": input_count - output_count,
    }
    with open(log_path, "a") as f:
        f.write(json.dumps(event) + "\n")

Stage 3: Dataset Snapshot Hashing

When a dataset version is finalized for training, generate a cryptographic hash of the dataset contents. This serves two purposes: it proves the training dataset was not modified after documentation was created, and it enables exact reproduction of the training run.

import hashlib, json
from pathlib import Path

def snapshot_dataset(dataset_path: str, metadata: dict) -> dict:
    h = hashlib.sha256()
    for filepath in sorted(Path(dataset_path).rglob("*")):
        if filepath.is_file():
            h.update(filepath.read_bytes())
    snapshot = {
        "sha256": h.hexdigest(),
        "dataset_path": dataset_path,
        "snapshot_ts": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "metadata": metadata,
    }
    return snapshot

Store these snapshots in your version-controlled documentation repository alongside the technical documentation.

Stage 4: Lineage Graph

For complex pipelines with multiple input datasets, a lineage graph makes the relationships queryable and auditable. Tools like DVC (open source, MIT) or Apache Atlas provide lineage tracking with API access:

# dvc.yaml — lineage declared as pipeline stages
stages:
  ingest_public_corpus:
    cmd: python scripts/ingest.py --source public_corpus_v3
    deps:
      - scripts/ingest.py
    outs:
      - data/raw/public_corpus_v3.parquet

  clean_public_corpus:
    cmd: python scripts/clean.py --input data/raw/public_corpus_v3.parquet
    deps:
      - scripts/clean.py
      - data/raw/public_corpus_v3.parquet
    outs:
      - data/processed/public_corpus_v3_clean.parquet

  assemble_training_set:
    cmd: python scripts/assemble.py --config config/dataset_mix_v2.yaml
    deps:
      - scripts/assemble.py
      - data/processed/
    outs:
      - data/final/training_v7.parquet

DVC tracks the full dependency graph and stores checksums at each stage, giving you an auditable lineage record without additional infrastructure.

Connecting Provenance to Other Art.10 Obligations

Data provenance records are not standalone — they feed directly into the other Art.10 compliance obligations covered in this series:

Art.10(3) — Errors and completeness: Provenance logs demonstrate what cleaning and error-correction steps were applied. Without this record, the claim that data is "free of errors" cannot be substantiated.

Art.10(5) — Bias examination: You cannot examine training data for biases without knowing its origin and demographic composition. Provenance records are the prerequisite for the bias auditing process covered in Post #2 of this series.

Art.12 — Record-keeping and logging: Art.12 requires automatic logging of high-risk AI system events. Data provenance records complement the runtime logs by covering the pre-training phase that Art.12 does not reach.

Annex IV — Technical documentation: Section 2(f) of Annex IV requires the technical documentation to include "a general description of the training data and methodologies". Provenance records are the source material for this section.

Third-Party Training Data

If your training data comes from a third-party supplier — a licensed corpus, a data broker, a pre-trained model — Art.10 still applies. You are responsible for:

Documenting the origin and collection methodology as reported by the supplier (obtain this in writing)
Verifying that the supplier's collection practices align with applicable GDPR obligations if personal data is involved
Recording what validation steps you applied to the received data before incorporating it into your pipeline

Contractual data governance clauses with training data suppliers are advisable: require that suppliers maintain Art.10-compatible provenance records and provide access to them in the event of an audit.

For pre-trained models used as a base for fine-tuning, document the model card's stated training data provenance and note any gaps between what is documented and what Art.10 would require if you had collected that data yourself.

Records Retention

Article 10 does not specify a retention period for provenance records. However, Annex IV technical documentation must be retained for 10 years after the last high-risk AI system placing on the market. Provenance records are part of this technical documentation and should be treated with the same retention schedule.

Practical implication: if your training dataset was assembled in 2025 for a system placed on the market in 2026, those provenance records must be retained until at least 2036.

Checklist: Art.10 Data Provenance Compliance

Dataset registry covers every training, validation, and testing dataset component
Each registry entry includes source type, origin URL or supplier, collection dates, record count, and legal basis
Preparation operations log covers every transformation step with operator, date, parameters, input/output record counts
Annotation documentation includes guidelines version, annotator group identifiers, IAA scores, and adjudication process
Aggregation and sampling records document mixing ratios and split methodology
Cryptographic snapshot of finalized training dataset is stored with technical documentation
Lineage graph or equivalent representation links all pipeline stages
Third-party data sources have supplier-provided provenance documentation
Records retention policy aligned with 10-year Annex IV obligation
Provenance records reviewed and updated whenever training data is refreshed

What Comes Next in This Series

Post #4 covers how to build automated data governance CI/CD gates — checks that run on every training data refresh to verify that provenance records are complete and the dataset still meets Art.10(3) requirements before a new training run can begin.

Post #5 will assemble the complete Art.10 compliance checklist and training data documentation package for Annex IV submission.

The August 2, 2026 deadline applies to all providers who have placed or are placing high-risk AI systems on the EU market. Data governance documentation must be complete at the time of placing the system on the market — not at the time of first audit.

This post is part of the sota.io EU AI Act Data Governance Sprint series covering Article 10 compliance for high-risk AI providers. Read Post #1: Training Data Documentation Requirements | Read Post #2: Dataset Diversity and Bias Testing

EU-Native Hosting

Ready to move to EU-sovereign infrastructure?

sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.

Join the waitlist View pricing