Executive summary
- What: Evidence-grade annotation is the category of training-data production where every artefact a regulator, Notified Body, or Compliance auditor will ask for is captured at the moment of labelling — and exported as a per-dataset evidence bundle alongside the labels themselves.
- What it isn't: "High-quality annotation." High quality is a property of the labels; evidence-grade is a property of the labelling process. Quality and defensibility are independent axes.
- Who's asking for it: EU AI Act Article 11, HIPAA Section 164.502 audit-trail rules, SOC 2 Trust Services Criteria CC7, ISO/IEC 42001:2023 Annex A.6 — and every Notified Body conducting a 2026 conformity assessment.
- The category-defining insight: Three of the six evidence artefacts decay irrecoverably the moment the annotation tool moves to the next batch. Reconstruction is not documentation; it's forensic re-annotation.
This article does two things. First, it defines "evidence-grade annotation" as a category — the term LabelFort operates under and the standard we believe regulated-AI buyers should ask every vendor to meet by name. Second, it lists the six artefacts that compose it, the four regulatory regimes that ask for them, and how to test whether a vendor's "evidence-grade" claim is real or marketing.
What is evidence-grade annotation?
Evidence-grade annotation is data annotation where every artefact required to defend the dataset under audit is produced at the moment of labelling, versioned, and exportable as a per-dataset evidence bundle alongside the labels themselves.
That sentence is the whole definition. Each clause does specific work:
- "every artefact required to defend the dataset under audit" — not "some" artefacts. The set is determined by what regulators and Notified Bodies ask for, not by what the vendor finds convenient to log.
- "produced at the moment of labelling" — not retrospectively. Three of the artefacts decay irrecoverably once the annotation tool moves on.
- "versioned" — every artefact has an immutable timestamp and version hash. You need the version that was applicable to the specific records being audited.
- "exportable as a per-dataset evidence bundle" — a single bundle, on demand, in a documented format. Not "we can assemble it from logs if you give us two weeks."
Why "high-quality annotation" is the wrong frame
For most of the 2018–2023 era of commercial data annotation, "quality" meant label accuracy and consistency. These remain important — they describe whether the labels are correct.
The regulatory frame introduced in 2024–2026 asks a different question: can you prove how the labels were produced, by whom, under which version of which guideline, and against which audit-trail control?
A dataset can be labelled correctly and still fail Annex IV Section 2 if the labelling procedure cannot be evidenced. Quality and defensibility are independent.
The six evidence artefacts
The six evidence artefacts that compose evidence-grade annotation.
1. Versioned annotator guideline
The exact instruction set used to label each record, with a version hash. The audit will ask which guideline was applicable to record 14,823. The answer is in the version history, not in someone's memory. Maps to: EU AI Act Annex IV Section 2; ISO/IEC 42001 Annex A.6.2.
2. Per-record annotator and adjudicator identity, with credentials
For every record: who labelled it, when, with what qualification (board-certified pathologist, JD, native speaker), and — if reviewed — who reviewed it. Captured at the moment of label, with cryptographic timestamps. Maps to: HIPAA Section 164.502; EU AI Act Annex IV Section 2.
3. Cohort-level inter-rater reliability
Cohen's κ or Krippendorff's α, broken down by the cohorts the model's Section 3/4 will need to evidence — by patient demographic, scanner manufacturer, geography, language. A single project-level IRR figure is not enough. Maps to: EU AI Act Annex IV Section 4; ISO/IEC 42001 Annex A.6.2.
4. Dataset provenance log, including cross-border transfer
Where the data came from (source, licence, lawful basis), how it was selected, how it crossed any borders to reach the annotators, and a deduplication record. Maps to: EU AI Act Annex IV Section 2; GDPR Article 28; DPDP Sections 8–10; SOC 2 CC7.
5. Data cleaning code with commit hash, plus sample log
The outlier-detection logic, de-duplication logic, missing-value handling — applied as code, not "common sense." Commit hash of the cleaning script. A sample log of cleaned versus uncleaned records. Maps to: EU AI Act Annex IV Section 2; ISO/IEC 42001 Annex A.6.2.
6. Datasheet, Gebru-pattern, auto-populated
A single per-dataset document following the Gebru et al (2018) "Datasheets for Datasets" framework — 41 questions across motivation, composition, collection, preprocessing, uses, distribution, maintenance. Auto-populated from the underlying log rather than written from memory. Maps to: EU AI Act Annex IV Section 2; ISO/IEC 42001 Annex A.6.2; NIST AI RMF.
When all six are captured at annotation time and exported together, the dataset is evidence-grade. When any one is missing, the gap costs more to remediate than the labels did to produce.
The four regulatory regimes that ask for evidence-grade
Each regime asks for a different subset of the six, but the union of their demands is exactly the six. Build for the union and you pass each individual audit.
| Regime | What it asks for | Effective date |
|---|---|---|
| EU AI Act Annex IV (Article 11) | All six. Section 2 calls out labelling procedures, datasheets, cleaning methodologies, provenance. Section 4 calls out cohort-level performance evidence. | 2 August 2026 |
| HIPAA Section 164.502 + 164.312 | Audit-trail (#2), provenance (#4), minimum-necessary access controls. Plus role separation enforced inside the annotation tool. | In force |
| SOC 2 Trust Services Criteria | CC7 (change management) requires evidence of dataset changes; CC9 (risk mitigation) requires the provenance log; CC8 (data confidentiality) requires the transfer log. | In force |
| ISO/IEC 42001:2023 Annex A.6 | Data quality and integrity controls. A.6.2.4 explicitly: documented labelling procedures, IAA records, datasheet, provenance, cleaning. | Published October 2023 |
Evidence-grade annotation and AI compliance: how the standard maps to your audit
If your organisation already runs AI compliance programmes — EU AI Act readiness, HIPAA audit-trail reviews, SOC 2 change-management testing, ISO/IEC 42001 assessments — the evidence-grade standard tells you exactly what to ask your annotation vendor for. Each of the six artefacts answers a specific audit question: which guideline version applied to this record, who labelled it and with what credentials, how reliable labels are by cohort, where the data came from and how it crossed borders, what cleaning logic was applied, and what the dataset documentation says. When those six are captured at labelling time and exported as one bundle, your AI compliance review moves from reconstruction to verification.
How evidence-grade differs from adjacent concepts
"Audit-ready annotation" — broader and softer. Often means the vendor can produce some evidence on request, after the fact. Evidence-grade means the evidence is produced at the moment and exported by default.
"ISO/IEC 42001-aligned annotation" — a management-system claim. A 42001-certified vendor has documented procedures for managing AI risk; the certificate does not guarantee that those procedures are operationalised inside the annotation tool.
"Compliance-first annotation" — a positioning claim. The test is the same: can the vendor demonstrate a per-dataset evidence bundle from a previous client, with all six artefacts present?
The academic and industry lineage
Evidence-grade isn't a phrase invented in a vacuum. It sits at the intersection of three strands of work that have been building since 2018:
- Datasheets for Datasets (Gebru et al, 2018). The first widely-cited framework for structured dataset documentation. 41 questions, used today as the de facto standard for the Annex IV datasheet requirement.
- Model Cards for Model Reporting (Mitchell et al, 2019). The companion framework for documenting trained models. Required by the EU AI Act for general-purpose AI providers under Articles 53–55.
- NIST AI Risk Management Framework (NIST AI 100-1, 2023). The Govern function explicitly requires documented training-data lineage as a precondition to claims about model behaviour.
A worked example — an evidence-grade bundle for a 47,000-image retinal dataset
For a high-risk diabetic-retinopathy screening AI going to market in the EU:
| Artefact | What's in the bundle |
|---|---|
| Annotator guideline | guideline_v1.4.pdf + guideline_history.json showing v1.0–v1.4 with timestamped diffs. Every record carries a guideline_version_hash. |
| Annotator identity log | annotators.csv listing 6 ophthalmologists, their board certifications at time of project, and a per-record annotator_id + reviewer_id. |
| Cohort-level IRR | irr.json with Cohen's κ broken down by 6 cohorts. Adult κ = 0.86; geriatric κ = 0.79 (flagged for re-review). |
| Provenance log | provenance.json with source hospital, licence, lawful basis (Article 9 GDPR + DPDP Section 4), cross-border transfer (India → EU under SCC v2.0), dedup record. |
| Cleaning code | cleaning/ directory at commit hash a4b2f9c. Plus cleaning_sample.jsonl — 100 records showing before/after. |
| Gebru-pattern datasheet | datasheet.pdf (12 pages, 41 Gebru questions answered) auto-populated from the underlying log at export time. |
This bundle is what a Notified Body opens first when conducting an Annex IV Section 2 review. The labels are downstream; the bundle is the audit.
How to test a vendor's evidence-grade claim
- Ask for the certificate. Get the certificate number and verify against the certification body's public register.
- Ask for a redacted evidence bundle from a previous client. The vendor can redact customer-identifying fields; the structure of the bundle is the test.
- Ask to see the annotator-identity capture inside the tool. If it's an admin export rather than a per-record field, you're looking at a retrofit.
- Ask which IRR metric, at what level of granularity, with what cohort breakdown. A vendor who answers "Cohen's Kappa at project level" is one regulatory cycle behind.
- Ask whether the export is one click or a ticket. If you have to file a request to get the evidence bundle, it isn't evidence-grade.
Compliance posture
Five frameworks. One annotation backbone that passes Legal, Security, and Procurement.
FAQ
Q. What does "evidence-grade annotation" actually mean?
Evidence-grade annotation is the category of training-data production where every artefact a regulator, Notified Body, or Compliance auditor will ask for is captured at the moment of labelling and exportable as a per-dataset evidence bundle.
Q. How is evidence-grade annotation different from high-quality annotation?
High-quality annotation describes the label accuracy. Evidence-grade annotation describes the audit-defensibility of the labelling process. Quality and defensibility are independent axes.
Q. Why can't evidence be reconstructed after the data has been shipped?
Three of the six artefacts decay irrecoverably at the moment the annotation tool moves to the next batch. Reconstruction is forensic re-annotation, not documentation.
Q. Which regulations require evidence-grade annotation?
Four make it explicit. EU AI Act Article 11 Annex IV Section 2. HIPAA Section 164.502 audit-trail. SOC 2 Trust Services Criteria CC7. ISO/IEC 42001:2023 Annex A.6. Together they define the working AI compliance baseline for 2026.
Q. How is evidence-grade annotation related to Datasheets for Datasets?
Datasheets for Datasets (Gebru et al, 2018) is the upstream academic framework. Evidence-grade annotation operationalises the Composition, Collection Process, and Preprocessing sections.
Q. Does ISO/IEC 42001 certification mean a vendor produces evidence-grade annotation?
No. Necessary but not sufficient. Ask specifically whether the vendor's tool captures the six artefacts at annotation time.
Q. Can general-purpose annotation tools produce evidence-grade output?
Partially. Versioned guidelines, cohort-level IRR, provenance with cross-border logging, and exportable Gebru-pattern datasheets are usually retrofits.
Q. Is evidence-grade annotation more expensive than standard annotation?
On the headline per-label rate, marginally — typically 10–25% higher. On the risk-adjusted TCO, it's lower. See our Outsourcing pillar for the full model.
Next step
Don't have a per-dataset evidence bundle for your training data?
If you are shipping AI into a regulated market in 2026 — EU AI Act, HIPAA, DPDP, FDA SaMD — and you do not yet have a per-dataset evidence bundle for any of your training data, that is the gap to close before your next audit. Start with a Compliance Review — a one-hour structured walkthrough where we map your risk surface to LabelFort's controls.