Attestloop — agentic regulatory attestation

Question 01 / 04

Are we covered?

When a regulator publishes something new, leadership wants to know whether it affects us — and whether our existing controls already handle it. The honest answer is usually "we don't know yet, give us two weeks." Attestloop closes the per-publication slice of that gap to about fifteen minutes: read the document, extract the binding obligations, map them against a named control framework, surface the gaps. Multi-source monitoring across regulators is v3 work.

Source: Commission Guidelines on prohibited AI practices · Mapped against: NIST AI Risk Management Framework 1.0

71

binding obligations identified

61

mapped to existing controls

10

framework gaps surfaced

Run completed 2026-05-01 in 13 minutes 26 seconds · $2.09 total cost

Classifier confirms scope; Clarifier handles ambiguous cases; Extractor identifies binding obligations from the source PDF; Mapper compares each to the active control framework with explicit confidence floors; Critic second-passes any low-confidence mappings.

Question 02 / 04

What do we need to do, and by when?

Once a regulation is in scope, compliance teams need a list of concrete actions, owners, and deadlines — not "review your AI governance posture" but actual tasks tied to actual control IDs.

ID	Article	Requirement	Scope	Deadline
EUAIA-OBL-001	Article 5(1)(a)	Providers and deployers of AI systems that deploy manipulative or deceptive techniques shall comply with relevant appli…	Providers and deployers of AI systems deploying manipulative or deceptive techn…	2025-02-02
EUAIA-OBL-008	Section 2.4 / Article 3(3) AI Act; recital context of Article 5 AI Act	Providers must ensure their AI systems meet all relevant requirements before placing them on the market or putting them…	Providers of AI systems placed on the EU market or put into service in the Union	before placing on the market or putting into service
EUAIA-OBL-020	Section 2.9.1 (paragraph 53)	Member States must designate their competent market surveillance authorities by 2 August 2025.	Member States	2025-08-02
EUAIA-OBL-071	Article 113	Providers and deployers of AI systems shall take necessary measures to ensure that they do not place on the market, put…	Providers and deployers of AI systems subject to Article 5 AI Act prohibitions	2025-02-02

Excerpt from the v6 canonical run. The full report contains 71 obligations with mapped control IDs, proposed actions, and Critic flags on low-confidence mappings. View the full report on GitHub.

Each obligation extracted with source paragraph, regulator-defined scope, deadline where specified, and evidence required. Mapper produces 1–3 control mappings per obligation, each with confidence score and reasoning anchored in specific control text.

Question 03 / 04

Prove it.

At audit or board review time, compliance teams need evidence: which obligations were assessed, when, by whom, against which version of the regulation, with what conclusion. This is what most existing tools do badly because they weren't designed for AI-pace regulatory change.

Provenance footer · v6 canonical run

- Regulation: EU Artificial Intelligence Act (Regulation 2024/1689) (`eu_ai_act`, EU)
- Framework: NIST AI Risk Management Framework 1.0 (`nist_ai_rmf`, 72 controls)
- Classifier model: `claude-haiku-4-5-20251001`
- Extractor model: `claude-sonnet-4-6`
- Mapper model: `claude-sonnet-4-6`
- Critic model: `claude-sonnet-4-6`
- Classifier prompt SHA-256: `b59962514c4342fc1d6181fb3964dd366c8f6e450218d4e4ff3b02c50038b099`
- Extractor prompt SHA-256: `0828eebb6dd8ad34d769f36773f14888bb048bcdc5ca02e940509fd42701b7ba`
- Mapper prompt SHA-256: `9090c11e1e4b04f07ab617e765a4d0342497ebdccdc2faa88410b8d2424d9cfd`
- Critic prompt SHA-256: `8de784ba4876b414c22f901c530dd2321c591eeac2c9fd36481bc3d0231979c7`
- Critic decisions: 44 reviewed (15 flagged)
- Started at: 2026-05-01T00:08:08.730468+00:00
- Total cost: $2.0926
- Total tokens: 218,476 input / 62,434 output

Every LLM call also logs input, output, model, prompt version, cost, and latency to immutable JSON. Each run produces a hashed provenance footer that survives audit.

Per-call audit trail
Hashed prompt versions
Sourced control library
Reproducible against same source
Second-pass review by independent Critic agent

Provenance is a first-class output, not an afterthought. The system is designed so that every claim in the report links back through a chain of inputs, prompts, model versions, and timestamps that an auditor can verify.

Question 04 / 04

What's coming next?

Boards ask about the regulatory pipeline, not just last week's publication. What's in flight at the regulator that we should be preparing for? Today (v2.0.0) Attestloop answers this on demand — point it at any regulator URL for ad-hoc triage. Scheduled monitoring of regulator sources is the v3 work.

Today (v2.0.0)

On-demand. Point Attestloop at any regulator URL and it tells you whether the publication is in scope and what obligations it contains. Useful for ad-hoc triage but not for forward monitoring.

Architected, shipping in v3

Watcher agent polls regulator sources on a schedule — EUR-Lex, FCA Handbook, EBA, ICO, ESMA — deduplicates against history, and surfaces new in-scope publications before they hit your inbox manually. Per-regulator scrape adapters in development; the architecture and registry already support multi-source polling.

v3 backlog

Forward-looking signals from regulator pipelines: published consultations, draft technical standards, work programmes. Distinguishes between 'binding next month' and 'consultation closing in March' so compliance teams can prioritise. Gives boards real horizon answers, not just last-week summaries.

Today: ad-hoc triage. v3: scheduled monitoring. v3+: regulator pipeline awareness. Honesty about which question Attestloop answers when matters as much as the answers themselves.

The Watcher agent is the v3 work that converts Attestloop from on-demand triage to scheduled regulatory monitoring. Multi-source polling, per-regulator adapters, deduplication against historical runs, and alerting integration are all tracked in the GitHub backlog.

Pipeline

How the pipeline works

The diagram below is generated directly from the compiled LangGraph state machine — scripts/render_graph.py writes the Mermaid source to docs/orchestration/v6_pipeline.mmd, which the page renders client-side. If the orchestration changes, the diagram regenerates automatically. v2.0.0's five-agent topology with conditional routing is what's drawn here. Click any agent below for role, prompt, sample input and output, and per-call metrics from the canonical v6 run.

Compiled LangGraph state machine · v2.0.0

Loading diagram… (if this persists, see docs/orchestration/v6_pipeline.mmd in the repo for the source.)

LLM agents · click to expand

Classifier

Active in v2.0.0

Role

Decides whether a fetched publication is in scope for the named regulation, and what category of document it is. Runs once per pipeline invocation. If the verdict is in_scope=False at confidence ≥ 0.7 the pipeline routes straight to the out-of-scope report; below 0.7 the Clarifier re-runs classification with augmented context before committing.

System prompt


                        src/attestloop/regulations/eu_ai_act/classifier.md

Show full prompt (375 words)

# EU AI Act — Classifier prompt (v1)

You are the classifier stage of the Attestloop pipeline. Your job is to
decide whether a single fetched publication is a **binding obligation
source under the EU AI Act (Regulation (EU) 2024/1689)**, or merely
contextual material about it.

## Decision rule

A publication is **in scope** (`in_scope: true`) only if it is one of:

- The **EU AI Act regulation** itself or a consolidated version of it.
- A **regulatory technical standard (RTS)** or **implementing act**
  adopted under the AI Act and published in the Official Journal.
- A **delegated act** or **official amendment** to the AI Act.
- A **binding guideline** issued by the European Commission, the AI
  Office, or the AI Board where the text itself states it creates
  obligations on providers, deployers, importers, or distributors.

A publication is **out of scope** (`in_scope: false`) if it is:

- A **press release**, news article, blog post, speech, opinion piece,
  podcast transcript, or interview.
- A **commentary**, explainer, FAQ, infographic, or training material —
  even when published by an official body — unless the document text
  itself asserts binding force.
- A document about a **different regulation** (GDPR, DSA, DMA, NIS2,
  national AI legislation) that only mentions the AI Act in passing.
- A **draft** that has not been adopted (e.g. a Commission proposal
  still in trilogue).

When uncertain, prefer `in_scope: false` and explain the doubt in
`reasoning`. False positives waste downstream extraction work; false
negatives only mean a human re-runs with a corrected URL.

## Category

Pick the single best `category`:

- `regulation` — the AI Act itself, consolidated text, an RTS, a
  delegated/implementing act, or another instrument with direct legal
  force.
- `guideline` — a binding guideline (only when the text itself asserts
  binding force).
- `amendment` — an official amendment, corrigendum, or consolidated
  re-publication after amendment.
- `press_release` — any communication about the regulation that does
  not itself create obligations.
- `other` — anything else (commentary, speeches, training material,
  unrelated documents).

## Output

Return a JSON object matching the `ClassifierOutput` schema exactly:
`in_scope` (bool), `category` (one of the five literals above),
`confidence` (float in `[0, 1]`), `reasoning` (one short paragraph
citing the strongest signals you used). No extra fields, no prose
outside the JSON.

Input (v6 run) Show

Publication title: Commission Guidelines on prohibited artificial intelligence practices established by

EN   EN 
 
 
 
EUROPEAN 
COMMISSION  
Brussels, 29.7.2025  
C(2025) 5052 final 
 
COMMUNICATION FROM THE COMMISSION 
Commission Guidelines on prohibited artificial intelligence practices established by 
Regulation (EU) 2024/1689 (AI Act)  
 
 

 
 
 
CONTENTS 
1. Background and objectives ............................................................................................................. 1 
2. Overview of prohibited AI practices ..........................................................

Output (v6 run) Show

{
  "in_scope": true,
  "category": "guideline",
  "confidence": 0.95,
  "reasoning": "This is a Commission Communication (C(2025) 5052 final) containing binding guidelines on prohibited AI practices under the EU AI Act (Regulation (EU) 2024/1689). The document is explicitly titled \"Commission Guidelines on prohibited artificial intelligence practices established by Regulation (EU) 2024/1689\" and provides detailed interpretive guidance on Article 5 prohibitions. While styled as a \"Communication,\" it functions as binding guidance from the Commission on how the AI Act's prohibitions apply to providers, deployers, and other responsible actors. The document contains mandatory (\"shall,\" \"must\") language throughout and covers the material and personal scope of the prohibitions, exclusions, and enforcement mechanisms, creating substantive obligations. The publication is dated 29.7.2025 and clearly addresses the content, application, and interpretation of binding legal obligations under the AI Act itself, making it a binding guideline in scope."
}

Metrics

Model

claude-haiku-4-5-20251001

Total calls

1

Input tokens

3,561

Output tokens

282

Cost

$0.00

p50 latency

3.68s

Clarifier

Active in v2.0.0 · rarely fires

Role

Triggered only when the Classifier returns in_scope=False at confidence < 0.7. Extracts additional context (table of contents, first 5 pages, or section headings — whichever is most informative) and re-runs the Classifier on the augmented input. The Clarifier itself makes no classification decision; its single LLM call is a re-invocation of the Classifier (Haiku) on the augmented prompt. Rarely fires on real regulator URLs that classify confidently — the Commission Guidelines URL never triggers it. Exercised end-to-end via scripts/smoke_clarifier.py against a synthetic ambiguous fixture.

System prompt


                        src/attestloop/agents/clarifier.py (delegates the LLM call to the Classifier)

(no LLM prompt)

(no prompt — implemented as code in src/attestloop/agents/clarifier.py (delegates the LLM call to the Classifier))

Input (v6 run) Show

(when triggered: the original cleaned text plus an --- ADDITIONAL CONTEXT FROM CLARIFIER --- block carrying the table of contents, first 5 pages, or section headings extracted from the publication)

Output (v6 run) Show

(when triggered: a re-classification ClassifierOutput plus a structured ClarifierOutput recording the initial classification, the source of the augmented context, and the re-classification verdict — preserved in clarifier_output.json for audit)

Extractor

Active in v2.0.0

Role

Pulls discrete, citeable obligations from the publication body. Runs once per chunk; the v6 source PDF is split into 12 ~40 000-character chunks with a 2 000-character overlap. After all chunks complete, a fuzzy deduplicator merges paraphrased duplicates that span chunk boundaries.

System prompt


                        src/attestloop/regulations/eu_ai_act/extractor.md

Show full prompt (473 words)

# EU AI Act — Extractor prompt (v1)

You are the extractor stage of the Attestloop pipeline. Your job is to
read a publication that has already been classified as in scope for the
**EU AI Act (Regulation (EU) 2024/1689)** and produce a structured list
of **discrete, binding obligations** it places on regulated parties.

## What counts as an obligation

A clause is an obligation if **all** of the following hold:

- It uses binding language: *shall*, *must*, *is required to*, *are
  prohibited from*. Recitals (numbered "(1)", "(2)" …) are
  **explanatory** and never on their own create obligations — skip
  them, even when they describe the spirit of an article.
- It places a duty on a **named regulated party** under the AI Act:
  provider, deployer, importer, distributor, authorised representative,
  notified body, market surveillance authority, or the AI Office.
- It is **specific enough to be auditable**: a thing the regulated
  party either does or fails to do.

Skip definitions, scope clauses, and recitals. Skip aspirational
language ("aims to", "should consider", "where appropriate" without a
concrete duty). Skip clauses that only address Member States or the
Commission.

## How to write each obligation

Produce one entry per discrete duty. If a single article enumerates
several duties (e.g. "providers shall (a) … (b) … (c) …"), emit one
obligation per lettered point. Conversely, do not split a single duty
across multiple entries to inflate the count.

For each obligation:

- `id` — stable identifier of the form `EUAIA-OBL-NNN`, zero-padded to
  three digits, numbered sequentially in the order obligations appear
  in the source. Example: `EUAIA-OBL-001`. The same article should
  produce the same ID across runs of the same publication, so number
  strictly by reading order.
- `source_paragraph` — the citation back to the source, as exact as
  the document supports (e.g. `Article 9(2)(a)`, `Article 16`,
  `Annex IV, point 2`). Do not paraphrase the citation.
- `requirement_text` — a faithful, near-verbatim restatement of the
  duty in one or two sentences. Preserve binding verbs ("shall",
  "must"). Do not add conditions the source did not state.
- `scope` — who the duty falls on and to what AI systems it applies
  (e.g. "Providers of high-risk AI systems listed in Annex III").
- `deadline` — concrete date or relative window if the source gives
  one (e.g. `2027-08-02`, `before placing on the market`,
  `within 15 days of becoming aware`). `null` if the source is
  silent.
- `evidence_required` — what artefact a regulated party would need to
  show an auditor to prove compliance, drawn from the text where
  possible (e.g. "Documented risk management system covering the
  lifecycle of the AI system"). `null` if the source does not
  describe required evidence.

## Output

Return a JSON object matching the `ExtractorOutput` schema exactly:
`{"obligations": [...]}`. No extra fields, no prose outside the JSON.
If the publication contains no extractable obligations, return
`{"obligations": []}`.

Input (v6 run) Show

12 chunks of ~40 000 chars each, total 428 902 chars. Sample (chunk 1):

--- BEGIN PUBLICATION TEXT (chunk 1/12) ---
EN   EN 
 
 
 
EUROPEAN 
COMMISSION  
Brussels, 29.7.2025  
C(2025) 5052 final 
 
COMMUNICATION FROM THE COMMISSION 
Commission Guidelines on prohibited artificial intelligence practices established by 
Regulation (EU) 2024/1689 (AI Act)  
 
 

 
 
 
CONTENTS 
1. Background and objectives ............................................................................................................. 1 
2. Overview of prohibited AI practices ..........................................................

Output (v6 run) Show

71 obligations extracted across 12 chunks (paraphrased duplicates merged via rapidfuzz). Sample output (first obligation from chunk 1):

{
  "id": "EUAIA-OBL-001",
  "source_paragraph": "Article 5(1)(a) AI Act",
  "requirement_text": "Providers and deployers shall not place on the market, put into service, or use AI systems that deploy subliminal techniques beyond a person's consciousness or purposefully manipulative or deceptive techniques, with the objective or with the effect of materially distorting the behaviour of a person or a group of persons, causing or reasonably likely to cause significant harm.",
  "scope": "Providers and deployers of AI systems placed on the EU market, put into service, or used in the Union",
  "deadline": "2025-02-02",
  "evidence_required": null
}

Metrics

Model

claude-sonnet-4-6

Total calls

12

Input tokens

129,500

Output tokens

15,043

Cost

$0.61

p50 latency

18.07s

Mapper

Active in v2.0.0

Role

Maps each obligation to between zero and three controls in the chosen framework, with a confidence score and reasoning. v6 dispatches the per-obligation calls through asyncio.gather with an asyncio.Semaphore(8); the public sync entrypoint is unchanged but mapper wall-clock collapses 8.13× vs the v5-equivalent serial baseline. The 0.75 confidence floor is what produces the unmapped list — empty mappings for procedural duties NIST AI RMF doesn't cover are surfaced honestly rather than slot-filled.

System prompt


                        src/attestloop/frameworks/nist_ai_rmf/mapper.md

Show full prompt (761 words)

# NIST AI RMF — Mapper prompt (v2)

You are the mapper stage of the Attestloop pipeline. For a **single
obligation** extracted from a regulation, you select the **0 to 3 most
relevant subcategories** from the **NIST AI Risk Management Framework
1.0** that an organisation could use as the operational basis for
discharging that obligation.

## Inputs

You will be given:

- One `Obligation` (id, source_paragraph, requirement_text, scope,
  optional deadline, optional evidence_required).
- The full list of `Control` entries for NIST AI RMF 1.0, each with
  `id` (e.g. `GOVERN-1.1`, `MAP-2.3`), `function`, `category`, and
  `subcategory_text`.

## Hard constraints (read carefully — these change the answer)

- **Return between 0 and 3 mappings.** *Do not* fill slots. If only one
  subcategory genuinely meets the bar, return one. If none do, return
  zero. Padding the answer with weak mappings is wrong.
- **Confidence floor: 0.75.** Do not include any mapping with
  `confidence < 0.75`. The reasoning must be strong enough that you
  could defend the mapping to a regulator without hedging.
- **No hedging language.** If your reasoning would naturally use any
  of these phrases — *"thematically aligned"*, *"not a verbatim match
  but"*, *"broadly related to"*, *"could support"*, *"loosely
  corresponds to"*, *"in the spirit of"*, *"adjacent to"*, *"partially
  overlaps with"* — the mapping does not meet the threshold and you
  must drop it.
- **Prefer specificity over breadth.** A single mapping at 0.85 to a
  precisely-applicable subcategory beats three mappings at 0.70 to
  broadly-applicable subcategories. Generic catch-alls like
  `GOVERN-1.1` ("Legal and regulatory requirements involving AI are
  understood, managed, and documented") apply technically to every
  legal obligation; only return them when no more specific subcategory
  fits.
- **Empty mappings are a correct outcome, not a failure.** Some
  obligations have no high-confidence NIST AI RMF mapping. Examples:
  the procedural authorisation requirements in EU AI Act Article 5(2)–(4)
  (judicial pre-authorisation, registration in a member-state register,
  case-by-case notification of competent authorities) are
  governance/process duties on public authorities that NIST AI RMF —
  designed for AI developers and deployers — does not directly cover.
  In those cases the correct answer is `{"mappings": []}`. The pipeline
  reports unmapped obligations transparently in the audit trail; weak
  mappings would corrupt that trail.
- **However**, substantive provider or deployer obligations that
  involve understanding and complying with a legal requirement (for
  example, the open-source exclusion carve-out for AI systems that
  constitute prohibited practices) almost always have a defensible
  mapping to `GOVERN-1.1` and often to `MANAGE-1.1`. If you are about
  to return an empty mapping list for a substantive provider/deployer
  obligation that primarily concerns legal compliance, reconsider
  whether `GOVERN-1.1` at confidence ≥ 0.75 applies before returning
  empty.
- **Only return control IDs that appear in the supplied list.** Do not
  invent IDs, abbreviate them, or normalise the casing. Calling code
  rejects unknown IDs and the run will fail with a warning.
- **One mapping per chosen control.** If two different controls are
  both genuinely relevant, that is two mappings, not one mapping with
  two IDs.

## How to choose

Prefer the subcategory whose `subcategory_text` most directly
operationalises the duty in `requirement_text`. Tiebreakers in order:

1. **Direct overlap** of the duty's verb and object with the
   subcategory text (e.g. an obligation to maintain a risk-management
   system → `GOVERN-1.4` / `GOVERN-1.5`).
2. **Function fit.** Governance and accountability duties → GOVERN.
   Pre-deployment assessment, context, intended use, and impact
   analysis → MAP. Testing, evaluation, and ongoing measurement →
   MEASURE. Response, monitoring, incident handling, and lifecycle
   actions → MANAGE.
3. **Specificity.** Prefer a subcategory that names the artefact
   (training, inventory, third-party risk, post-deployment monitoring,
   etc.) over a generic parent.

## Confidence calibration

`confidence` is your honest estimate that the chosen subcategory is
the **best operational hook** for the obligation, on `[0, 1]`:

- `≥ 0.90` — direct, near-verbatim alignment. The subcategory text and
  the requirement text describe the same duty.
- `0.80 – 0.89` — strong fit. The subcategory operationalises a clear
  component of the duty without paraphrase tricks.
- `0.75 – 0.79` — defensible fit. The subcategory addresses the duty
  but you would expect a regulator to ask one clarifying question.
- **`< 0.75` — do not include.** Drop the mapping rather than emit it.

## Output

Return a JSON object matching the `MapperOutput` schema exactly:
`{"mappings": [{"obligation_id": "...", "control_id": "...",
"confidence": 0.0, "reasoning": "..."}, ...]}`. The `obligation_id` in
every entry must match the supplied obligation. `reasoning` is one to
three sentences citing the specific words in the subcategory text that
justify the mapping; if you cannot do that without hedging, the mapping
does not qualify.

If no subcategory clears the 0.75 bar, return `{"mappings": []}`. This
is correct.

Input (v6 run) Show

71 obligations, dispatched 8 at a time via asyncio.Semaphore. Sample input below:

{
  "id": "EUAIA-OBL-001",
  "source_paragraph": "Article 5(1)(a) AI Act; section 3.2.3.c), paragraph 95(2)",
  "requirement_text": "Providers and deployers of AI systems that deploy manipulative or deceptive techniques shall comply with relevant applicable legislation.",
  "scope": "Providers and deployers of AI systems deploying manipulative or deceptive techniques",
  "deadline": "2025-02-02",
  "evidence_required": "Documentation demonstrating compliance with applicable legislation"
}

Output (v6 run) Show

160 high-confidence mappings, 10 obligations unmapped. Sample mapping for EUAIA-OBL-001:

{
  "obligation_id": "EUAIA-OBL-001",
  "control_id": "GOVERN-1.1",
  "confidence": 0.85,
  "reasoning": "GOVERN-1.1 requires that 'legal and regulatory requirements involving AI are understood, managed, and documented,' which directly operationalises the obligation for providers and deployers to comply with applicable legislation governing manipulative or deceptive AI techniques. The evidence requirement — documentation demonstrating compliance with applicable legislation — maps precisely to the 'documented' component of this subcategory."
}

Metrics

Model

claude-sonnet-4-6

Total calls

71

Input tokens

43,641

Output tokens

24,858

Cost

$0.89

p50 latency

7.41s

Cache hit rate

89%

Cache savings

$1.34

Cold call

5.98s

Warm mean

6.58s

Critic

Active in v2.0.0

Role

Second-pass review of any obligation whose Mapper output includes at least one mapping below 0.80 confidence. Returns one of two decisions: confirm (mappings stand) or flag_for_review (mappings stand, but the report annotates them as flagged). The Critic is advisory — it does not auto-replace mappings, so a confidently-wrong Critic can't silently overwrite a defensible Mapper output. On the canonical v6 run it reviewed 44 obligations and flagged 15 (34% flag rate); each flag named a specific control being stretched semantically and at least one alternative the Mapper missed.

System prompt


                        src/attestloop/frameworks/nist_ai_rmf/critic.md

Show full prompt (598 words)

# NIST AI RMF — Critic prompt (v1)

You are the Critic stage of the Attestloop pipeline. You provide a
**second-pass review** of the Mapper's output for a single obligation
whose proposed mappings include at least one entry below 0.80
confidence. You **do not propose new mappings**. The Mapper's output
stands in the report regardless of your decision; your role is to flag
what a human reviewer should look at before the report ships.

## Inputs

You receive:

- One `Obligation` (`id`, `source_paragraph`, `requirement_text`,
  `scope`, optional `deadline`, optional `evidence_required`).
- The Mapper's proposed mappings for this obligation (1–3 entries,
  each with `control_id`, `confidence`, and `reasoning`).
- The full controls catalogue (the same list the Mapper saw, supplied
  in the cached system block above).

## The two allowed decisions

Return exactly one of these in the `decision` field:

- **`confirm`** — the proposed mappings are **defensible**. Use this
  when a low-confidence mapping is a reasonable inclusion despite the
  score: a 0.78 `GOVERN-1.1` mapping on a procedural duty that no more
  specific subcategory addresses is a defensible call. Confirming a
  moderate-confidence mapping is not endorsing it; it is acknowledging
  that the Mapper made a defensible choice within the framework's
  coverage.

- **`flag_for_review`** — at least one proposed mapping is questionable
  enough that a human should look at it before the report is published.
  Reach for this when:
  - The Mapper's reasoning leans on hedging language
    (*broadly applicable, could support, in the spirit of*).
  - The reasoning names a specific subcategory that is not the one
    actually returned (suggests the Mapper hedged toward a more
    familiar choice).
  - The proposed mapping addresses a related-but-distinct concern from
    the obligation — for example, application-scope controls returned
    against an obligation that primarily concerns regulator
    notification.
  - You can name a specific NIST AI RMF subcategory the Mapper missed
    that would have been a stronger fit. State it in `reasoning`, but
    **do not propose it as a replacement**.

## Hard rules

- **Do not return any value other than `confirm` or `flag_for_review`**
  in the decision field. The schema enforces this.
- **Do not propose alternative mappings.** The `reviewed_mappings` field
  is the audit record of what you looked at; the `decision` records
  what you concluded. The Mapper's output is the report's recommendation.
- **Bias toward `confirm`** for moderate-confidence catch-all mappings
  (`GOVERN-1.1`, `MAP-1.1`, `MANAGE-1.1`) when the obligation primarily
  concerns legal compliance and no more specific control fits. The
  point is to surface the unusual cases, not to flag every borderline
  mapping.

## Confidence calibration

`confidence` is your honest estimate that your decision is correct, on
`[0, 1]`. A `confirm` at 0.92 reads differently from a `confirm` at
0.75 — the score itself is signal. As a rough guide:

- `≥ 0.85` — you are sure the Mapper's call is defensible (or
  questionable enough to flag without ambiguity).
- `0.75 – 0.84` — you lean one way but can see the other side. Use
  this honestly; downstream reviewers learn to read the score.

Below `0.75` use only when the obligation or the proposed mappings are
genuinely ambiguous and you cannot pick a side. Such cases should be
rare — they indicate the Mapper itself probably belongs in the report
unmapped, not the Critic equivocating.

## Output

Return a JSON object matching the `CriticOutput` schema with **one
decision** in the `decisions` list:

```json
{
  "decisions": [
    {
      "obligation_id": "EUAIA-OBL-NNN",
      "decision": "confirm",
      "reasoning": "...",
      "confidence": 0.0,
      "reviewed_mappings": ["GOVERN-1.1", "MANAGE-1.1"]
    }
  ]
}
```

The `reviewed_mappings` list must contain the `control_id` of every
mapping you considered. The `obligation_id` must match the supplied
obligation. No extra fields, no prose outside the JSON.

Input (v6 run) Show

Per-obligation: one Obligation plus the Mapper's proposed mappings plus the full controls library. Sample input for EUAIA-OBL-005:

{
  "obligation": {
    "id": "EUAIA-OBL-005",
    "source_paragraph": "Article 5(1)(f), as analysed in paragraph (243)",
    "requirement_text": "Providers and deployers of AI systems shall not place on the market, put into service, or use AI systems to infer emotions of a natural person in the areas of workplace and education institutions, except where the use of the AI system is intended for medical or safety reasons.",
    "scope": "Providers and deployers of AI systems used to infer emotions in workplace or education institution contexts",
    "deadline": "2025-02-02",
    "evidence_required": null
  },
  "proposed_mappings": [
    {
      "obligation_id": "EUAIA-OBL-005",
      "control_id": "GOVERN-1.1",
      "confidence": 0.92,
      "reasoning": "The obligation is a direct legal prohibition under EU AI Act Article 5(1)(f): providers and deployers must not place on the market, put into service, or use emotion-inference AI systems in workplace or education contexts (except for medical/safety reasons). GOVERN-1.1 requires that 'legal and regulatory requirements involving AI are understood, managed, and documented,' which is the primary operational mechanism for an organisation to comply with this prohibition — by identifying, documenting, and enforcing this specific statutory restriction across its AI portfolio."
    },
    {
      "obligation_id": "EUAIA-OBL-005",
      "control_id": "GOVERN-1.6",
      "confidence": 0.8,
      "reasoning": "To enforce this prohibition, organisations must be able to identify which AI systems in their inventory perform emotion inference and in which deployment contexts (workplace or education). GOVERN-1.6 requires that 'mechanisms are in place to inventory AI systems,' directly enabling detection and control of any prohibited emotion-inference systems before or after market placement."
    },
    {
      "obligation_id": "EUAIA-OBL-005",
      "control_id": "MANAGE-1.1",
      "confidence": 0.78,
      "reasoning": "MANAGE-1.1 requires 'a determination as to whether the AI system achieves its intended purposes and stated objectives and whether its development or deployment should proceed.' For emotion-inference AI systems scoped to workplace or education settings, this gate-keeping determination is the direct operational control that would prevent prohibited deployment — or confirm that the narrow medical/safety exception applies before proceeding."
    }
  ]
}

Output (v6 run) Show

44 CriticDecision records: 29 confirmed, 15 flagged for human review. Sample decision below:

{
  "obligation_id": "EUAIA-OBL-005",
  "decision": "flag_for_review",
  "reasoning": "GOVERN-1.1 (0.92) and GOVERN-1.6 (0.80) are both defensible and well-reasoned: the former anchors the legal-compliance documentation duty, and the latter directly supports the inventory-based detection of prohibited emotion-inference systems by context. These two mappings are solid.\n\nThe concern lies with MANAGE-1.1 (0.78). The Mapper's reasoning frames MANAGE-1.1 as a 'gate-keeping determination' that would 'prevent prohibited deployment,' but MANAGE-1.1 is specifically about determining whether an AI system achieves its intended purposes and whether deployment should proceed — a general go/no-go efficacy and purpose-alignment check, not a legal-prohibition enforcement mechanism. The obligation in EUAIA-OBL-005 is a categorical statutory ban with a narrow exception; the relevant operational control is not a purpose-achievement determination but rather a risk response or decommissioning action. A stronger fit would have been MANAGE-1.3 (planning and documenting responses to high-priority risks, including avoidance) or GOVERN-1.7 (processes for decommissioning and phasing out AI systems safely), both of which more directly address the act of halting or removing a prohibited system from service. The Mapper's reasoning leans on 'gate-keeping' framing that is adjacent to but distinct from what MANAGE-1.1 actually covers, and the 0.78 confidence reflects that hedging. A human reviewer should assess whether MANAGE-1.1 adds sufficient value here or whether it should be reconsidered.",
  "confidence": 0.86,
  "reviewed_mappings": [
    "GOVERN-1.1",
    "GOVERN-1.6",
    "MANAGE-1.1"
  ]
}

Metrics

Model

claude-sonnet-4-6

Total calls

44

Input tokens

41,774

Output tokens

22,251

Cost

$0.58

p50 latency

10.95s

Cache hit rate

98%

Cache savings

$0.88

Cold call

7.98s

Warm mean

11.25s

Code, not agent

Four passive components in the pipeline. Each is deterministic Python in v2.0.0, not an LLM-driven agent — adding a model call where code already does the job is a common failure mode in agentic projects.

Watcher Architected, not yet shipping. Will poll regulator sources (EUR-Lex, FCA Handbook, EBA, ICO) on a schedule, deduplicate against history, and queue new in-scope publications. v3 backlog work; v2.0.0 runs on demand against a single URL.
Gap analyser Folded into the Mapper's confidence floor. Any obligation with zero high-confidence mappings is surfaced in the report's "Obligations with no high-confidence framework mapping" section. The 0.75 confidence floor is the gap- analysis policy.
Drafter Report builder code. Composes the executive summary, obligations table, control mappings table, unmapped section, Critic flags, and provenance footer from the typed Pydantic outputs of the upstream agents. Pure templating — no LLM call.
Reviewer queue On-disk artefacts. Each run writes runs/<run_id>/ with the report, run metadata, structured outputs, and per- agent call logs. Reviewing means opening the directory; v3 backlog adds a proper in-process queue.

Output

What it produces

Seven runs against the same source document. v1 through v5 walk the v1 architecture; v5_eq and v6 isolate the orchestration impact in v2.0.0.

Version	Approach	Obligations	Mappings	Unmapped	Cost (USD)	Runtime
v1	Truncated extractor (50 K char cap), mapper unconstrained	18	54	0	$0.62	5m 17s
v2	Chunked extractor (12 chunks), mapper unconstrained	68	203	0	$2.61	21m 22s
v3	Mapper confidence floor 0.75, no slot-filling	72	164	12	$2.78	41m 35s
v4	Anthropic prompt caching on mapper controls list	69	124	24	$1.19	14m 51s
v5	Fuzzy dedup, title fallback, null rendering, mapper nudge	71	154	13	$1.31	17m 17s
v5_eq	v6 code, V5_EQUIVALENT config (serial Mapper, no Critic, no Clarifier)	72	157	13	$1.31	12m 38s
v6	LangGraph + Critic + Clarifier + 8-way concurrent Mapper	71	160	10	$2.09	13m 26s

Six iterations from a 50,000-character truncated baseline to a LangGraph state machine with second-pass review and parallel execution. Each step kept the previous quality while changing one variable. The v6 canonical run produces the same Commission Guidelines attestation in 13 minutes 26 seconds at $2.09 — Mapper wall-clock 8.13× faster than v5_eq, total wall-clock essentially flat because the Critic adds ~3 minutes of sequential review work.

v3 → v4 caching delivered a 30× return on the cache write cost. v4 → v5 dedup removed 12 paraphrased duplicates that substring-match missed. v5 → v6 swapped sequential function calls for a typed state machine and 8-way concurrent Mapper. The v5_eq vs v6 pair isolates the orchestration impact under identical code. Read the writeup for the engineering detail.

Read the writeup →

Reproducibility

Run it yourself

Live re-run capability is tracked for v3. The cached v6 run above shows the full output — every prompt, every LLM response, every cost line, every obligation, every mapping. The Python source is on GitHub; the pipeline runs end-to-end on a single machine with an Anthropic API key.

v3 feature

Four questions every compliance team gets asked.

Are we covered?

What do we need to do, and by when?

Prove it.

What's coming next?

How the pipeline works

What it produces

Run it yourself