EU AI Act Art.10 Data Provenance Logging: Tracking Training Data Origin, Modifications, and Governance Records
Post #3 in the sota.io EU AI Act Data Governance Sprint Series
When an EU AI Act market surveillance authority audits your high-risk AI system, one of the first questions is: where did your training data come from, and can you prove how it was modified before training?
Article 10 of Regulation (EU) 2024/1689 does not treat training data as a black box. It requires providers to document the entire lifecycle of training, validation, and testing data — from original source through every transformation to final dataset composition. This is data provenance, and the EU AI Act makes it a legal obligation for high-risk AI systems.
This post covers what Art.10 requires for data provenance and origin documentation, what records must be maintained, and how to build a provenance-logging pipeline that satisfies both the regulation and real-world audit requirements.
What Art.10 Actually Requires for Data Provenance
Article 10(2) of Regulation (EU) 2024/1689 specifies the data governance practices that must cover training, validation, and testing data. Two sub-provisions directly govern provenance:
Art.10(2)(b) — Data collection and origin:
"the relevant data collection processes and the origin of data"
This requires documentation of how data was collected and where it came from. For each training dataset component, providers must record:
- The original source (public dataset, licensed corpus, proprietary collection, synthetic generation, third-party supplier)
- The collection methodology (web scraping, survey, sensor recording, labelling contract, API pull)
- The date range of collection
- Any licensing, consent, or legal basis under which the data was obtained
- Whether the source has changed or been updated since initial collection
Art.10(2)(c) — Data preparation operations:
"the relevant data preparation processing operations, such as annotation, labelling, cleaning, enrichment and aggregation"
Every transformation applied to raw data before it enters the training pipeline must be documented. This includes:
- Annotation and labelling: who labelled the data, what annotation guidelines were used, inter-annotator agreement metrics
- Cleaning: which records were removed and why (duplicates, out-of-distribution, corrupted, offensive content)
- Enrichment: what features were added from secondary sources
- Aggregation: how multiple datasets were combined, what weighting or sampling was applied
- Filtering: what criteria excluded certain records from the final training split
Together, Art.10(2)(b) and Art.10(2)(c) create a requirement for end-to-end data lineage — a traceable record from raw source to finalized training split.
Why "Data Provenance" Is More Than a Spreadsheet
Many teams treat data documentation as a one-page description filed with the technical documentation. This misunderstands what Art.10 requires.
The regulation specifically links provenance to compliance outcomes. Article 10(5) requires examination of data for biases — and that examination requires knowing what the data contains and where it came from. Article 10(3)'s requirement for data that is "free of errors" cannot be demonstrated without records of what cleaning steps were applied. Article 10(4)'s geographic and contextual representativeness test requires knowing the geographic origin of training samples.
Provenance records serve three concrete compliance functions:
-
Audit response: When a national market surveillance authority requests evidence of data governance, provenance logs are the primary artifact. A retrospective "we believe we used dataset X" is not sufficient.
-
Bias root-cause analysis: If a post-deployment bias incident occurs, provenance records allow you to trace whether the problem originated in the collection process, the labelling instructions, or the aggregation step.
-
Dataset update governance: When training data is refreshed, provenance records for the previous version establish the baseline for change impact assessment.
What Records Must Be Maintained
Article 10 does not prescribe a specific format, but the records must support the technical documentation requirements of Annex IV. Based on the obligation structure in Art.10(2), compliant provenance records cover the following:
Dataset Registry
A master registry listing every dataset component used in training, validation, and testing, with:
| Field | Description |
|---|---|
| Dataset ID | Unique identifier for internal reference |
| Name / version | Human-readable label and version string |
| Source type | Public / licensed / proprietary / synthetic / third-party |
| Source URL or supplier | Reference to the original location or provider contract |
| Collection date range | First and last date of data collection |
| Record count | Number of records included from this source |
| License / legal basis | GDPR legal basis if personal data; licensing terms otherwise |
| Last reviewed | Date of most recent provenance review |
Preparation Operations Log
For each dataset component, a chronological log of every preparation step:
| Step | Operation | Operator | Date | Input count | Output count | Parameters |
|---|---|---|---|---|---|---|
| 1 | Deduplication | ETL pipeline v2.3 | 2025-11-01 | 2,450,000 | 2,310,000 | Jaccard similarity > 0.95 |
| 2 | Language filtering | langdetect v1.0.9 | 2025-11-02 | 2,310,000 | 2,198,000 | EN/DE/FR only |
| 3 | PII redaction | presidio v2.2.0 | 2025-11-03 | 2,198,000 | 2,198,000 | PERSON, EMAIL, PHONE |
| 4 | Labelling (Round 1) | Contractor team A | 2025-11-05 | 50,000 sample | 50,000 | Guidelines v1.4 |
| 5 | Label QA | Internal review | 2025-11-12 | 50,000 | 49,200 | IAA threshold 0.82 |
Annotation Documentation
For supervised learning tasks, annotation records must include:
- Annotation guidelines version and effective date
- Annotator identities or anonymized contractor group identifiers
- Inter-annotator agreement (IAA) scores per label category
- Adjudication process for disagreements
- Samples excluded due to low-confidence annotations
Aggregation and Sampling Records
When multiple dataset components are combined:
- The mixing ratio applied (e.g., 60% corpus A, 30% corpus B, 10% synthetic)
- Rationale for the chosen ratio
- Any upsampling or downsampling applied to balance demographic groups
- Train/validation/test split methodology and seed values
Building a Provenance-Logging Pipeline
Most high-risk AI teams will need to integrate provenance logging into an existing data engineering workflow. The key principle is that provenance should be generated automatically as data flows through the pipeline, not assembled manually after training completes.
Stage 1: Source Registration
At the point of data ingestion, every new dataset source triggers the creation of a provenance record. This can be implemented as a registration step in your ETL or data warehouse layer:
from dataclasses import dataclass, asdict
from datetime import date
import json, hashlib
@dataclass
class DatasetSource:
dataset_id: str
name: str
version: str
source_type: str # "public" | "licensed" | "proprietary" | "synthetic"
source_url: str
collection_date_start: str
collection_date_end: str
license: str
legal_basis: str # GDPR Art.6 basis if personal data
record_count: int
def register_source(source: DatasetSource, registry_path: str):
with open(f"{registry_path}/{source.dataset_id}.json", "w") as f:
json.dump(asdict(source), f, indent=2)
Stage 2: Operation Logging
Each preparation step emits a provenance event. A lightweight event log stores the operation type, parameters, and record counts before and after:
import json, time
from pathlib import Path
def log_operation(dataset_id: str, operation: str, operator: str,
params: dict, input_count: int, output_count: int,
log_path: str):
event = {
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"dataset_id": dataset_id,
"operation": operation,
"operator": operator,
"params": params,
"input_count": input_count,
"output_count": output_count,
"records_removed": input_count - output_count,
}
with open(log_path, "a") as f:
f.write(json.dumps(event) + "\n")
Stage 3: Dataset Snapshot Hashing
When a dataset version is finalized for training, generate a cryptographic hash of the dataset contents. This serves two purposes: it proves the training dataset was not modified after documentation was created, and it enables exact reproduction of the training run.
import hashlib, json
from pathlib import Path
def snapshot_dataset(dataset_path: str, metadata: dict) -> dict:
h = hashlib.sha256()
for filepath in sorted(Path(dataset_path).rglob("*")):
if filepath.is_file():
h.update(filepath.read_bytes())
snapshot = {
"sha256": h.hexdigest(),
"dataset_path": dataset_path,
"snapshot_ts": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"metadata": metadata,
}
return snapshot
Store these snapshots in your version-controlled documentation repository alongside the technical documentation.
Stage 4: Lineage Graph
For complex pipelines with multiple input datasets, a lineage graph makes the relationships queryable and auditable. Tools like DVC (open source, MIT) or Apache Atlas provide lineage tracking with API access:
# dvc.yaml — lineage declared as pipeline stages
stages:
ingest_public_corpus:
cmd: python scripts/ingest.py --source public_corpus_v3
deps:
- scripts/ingest.py
outs:
- data/raw/public_corpus_v3.parquet
clean_public_corpus:
cmd: python scripts/clean.py --input data/raw/public_corpus_v3.parquet
deps:
- scripts/clean.py
- data/raw/public_corpus_v3.parquet
outs:
- data/processed/public_corpus_v3_clean.parquet
assemble_training_set:
cmd: python scripts/assemble.py --config config/dataset_mix_v2.yaml
deps:
- scripts/assemble.py
- data/processed/
outs:
- data/final/training_v7.parquet
DVC tracks the full dependency graph and stores checksums at each stage, giving you an auditable lineage record without additional infrastructure.
Connecting Provenance to Other Art.10 Obligations
Data provenance records are not standalone — they feed directly into the other Art.10 compliance obligations covered in this series:
Art.10(3) — Errors and completeness: Provenance logs demonstrate what cleaning and error-correction steps were applied. Without this record, the claim that data is "free of errors" cannot be substantiated.
Art.10(5) — Bias examination: You cannot examine training data for biases without knowing its origin and demographic composition. Provenance records are the prerequisite for the bias auditing process covered in Post #2 of this series.
Art.12 — Record-keeping and logging: Art.12 requires automatic logging of high-risk AI system events. Data provenance records complement the runtime logs by covering the pre-training phase that Art.12 does not reach.
Annex IV — Technical documentation: Section 2(f) of Annex IV requires the technical documentation to include "a general description of the training data and methodologies". Provenance records are the source material for this section.
Third-Party Training Data
If your training data comes from a third-party supplier — a licensed corpus, a data broker, a pre-trained model — Art.10 still applies. You are responsible for:
- Documenting the origin and collection methodology as reported by the supplier (obtain this in writing)
- Verifying that the supplier's collection practices align with applicable GDPR obligations if personal data is involved
- Recording what validation steps you applied to the received data before incorporating it into your pipeline
Contractual data governance clauses with training data suppliers are advisable: require that suppliers maintain Art.10-compatible provenance records and provide access to them in the event of an audit.
For pre-trained models used as a base for fine-tuning, document the model card's stated training data provenance and note any gaps between what is documented and what Art.10 would require if you had collected that data yourself.
Records Retention
Article 10 does not specify a retention period for provenance records. However, Annex IV technical documentation must be retained for 10 years after the last high-risk AI system placing on the market. Provenance records are part of this technical documentation and should be treated with the same retention schedule.
Practical implication: if your training dataset was assembled in 2025 for a system placed on the market in 2026, those provenance records must be retained until at least 2036.
Checklist: Art.10 Data Provenance Compliance
- Dataset registry covers every training, validation, and testing dataset component
- Each registry entry includes source type, origin URL or supplier, collection dates, record count, and legal basis
- Preparation operations log covers every transformation step with operator, date, parameters, input/output record counts
- Annotation documentation includes guidelines version, annotator group identifiers, IAA scores, and adjudication process
- Aggregation and sampling records document mixing ratios and split methodology
- Cryptographic snapshot of finalized training dataset is stored with technical documentation
- Lineage graph or equivalent representation links all pipeline stages
- Third-party data sources have supplier-provided provenance documentation
- Records retention policy aligned with 10-year Annex IV obligation
- Provenance records reviewed and updated whenever training data is refreshed
What Comes Next in This Series
Post #4 covers how to build automated data governance CI/CD gates — checks that run on every training data refresh to verify that provenance records are complete and the dataset still meets Art.10(3) requirements before a new training run can begin.
Post #5 will assemble the complete Art.10 compliance checklist and training data documentation package for Annex IV submission.
The August 2, 2026 deadline applies to all providers who have placed or are placing high-risk AI systems on the EU market. Data governance documentation must be complete at the time of placing the system on the market — not at the time of first audit.
This post is part of the sota.io EU AI Act Data Governance Sprint series covering Article 10 compliance for high-risk AI providers. Read Post #1: Training Data Documentation Requirements | Read Post #2: Dataset Diversity and Bias Testing
EU-Native Hosting
Ready to move to EU-sovereign infrastructure?
sota.io is a German-hosted PaaS — no CLOUD Act exposure, no US jurisdiction, full GDPR compliance by design. Deploy your first app in minutes.