AI LABS · EVALUATION STUDIO · TEST IT BEFORE IT SHIPS

Test it before it ships.

For the teams who stopped trusting the eval script. Codify the rubric, score every release against it, and ship the result with proof attached.

RUBRIC
Codified once

Your judges. Your weights. Versioned.

RUN
Every release

Same rubric. Different model. Same standard.

PROOF
Attached

Scorecards and reviewer notes follow the release.

HOW IT WORKS

Three steps. No eval script.

Codify the rubric. Run every release against it. Ship with the proof.

STEP 01
WHAT WE CODIFY

Define the rubric

Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.

STEP 02
WHAT WE SCORE

Run every release against it

Each release passes through the same rubric. Drift, regression, and bias show up before the gate.

STEP 03
WHAT WE SIGN

Ship with proof attached

Scorecards, reviewer notes, and decisions stay with the release. Sealed when it ships.

RUBRIC ANATOMY · WHAT WE CODIFY

A rubric is governed before it scores.

Rubric Studio Cloud turns prompt context, model outputs, and expert judgment into approved criteria, criterion-level grading, evidence capture, scorecards, and regression memory. Four named steps. One evaluation record.

01 · AI DRAFT

Draft criteria from the prompt context

The model reads the prompt, retrieved context, and prior failures. It proposes criteria and warnings. Nothing activates yet.

REVIEW REQUIRED
02 · EXPERT APPROVAL

No AI-drafted rubric ships without a human signoff

A named approver reads the draft, edits the criteria, sets the weights, and signs the version. The history is permanent.

BLOCKED UNTIL SIGNED
03 · WORKER GRADING

Criterion-level grades with required evidence

Reviewers grade each criterion with the rubric and the evidence visible. Blocker reasons are first-class fields, not free-text notes.

CRITERION LEVEL
04 · SCORECARD MEMORY

Submitted grades roll into release reads

Each grade contributes to failure breakdowns, regression bank entries, and the model scorecard the release team carries to approval.

STORED · INDEXED
METHODOLOGY · HOW WE SCORE

Five practices that make a rubric survive a release.

A rubric that holds up under release pressure is not a Likert scale. It is a weighted contract with ground-truth anchors, a calibration loop, and a memory of the cases that have already escaped.

WEIGHTED CRITERIA

The rubric carries weights, not just labels

Each criterion has a weight, a passing threshold, and a fail-state contract. A release that wins on grounding but loses on disclosure is not an averaged-out pass — it is a hold.

GROUND-TRUTH ANCHORS

Every run carries reference cases

We anchor each scoring run with a small set of ground-truth cases the team has agreed on. Drift in the score against the anchors is itself a signal.

BASELINE vs CANDIDATE

Two columns, same rubric

Each run produces a side-by-side: the current production model and the candidate, scored on the same cases against the same criteria. The deltas are reviewable.

JUDGE CALIBRATION

AI judges with human override

AI judges run first for scale, with calibration scores held against a human reviewer panel. Disagreement above threshold routes to expert review with the case attached.

REGRESSION MEMORY

Every escaped failure becomes a permanent check

The cases the rubric missed get added to the next release's required set. Misses do not get to escape twice.

WHAT COMES OUT

What your team leaves with.

Every run leaves something the team can act on — and something the next release has to clear.

01

Scorecards

One read on what passed, what failed, and what needs another look.

↳ ARTIFACT
02

Review queues

Cases the rubric flags get routed to the right reviewer with the rubric reading attached.

↳ ARTIFACT
03

Replay suites

Every escaped case becomes a repeatable check the next release must pass.

↳ ARTIFACT
04

Decision timelines

What changed. Who approved it. What the rubric said at the time.

↳ ARTIFACT
05

Evidence packets

Rubric, reviewer notes, and verdict — ready when someone asks.

↳ ARTIFACT
TRACE ANATOMY · WHAT THE SCORE IS TIED TO

The score sits on the reasoning path.

Every score in the studio is tied to the exact prompt, retrieval, tool call, answer, judge reading, and human override that produced it. The reviewer never has to ask “what did the model see?” — it is attached.

01 · PROMPT

The customer or test case the run is grounded on.

02 · RETRIEVAL

Retrieved context, policy lookups, and prior case memory.

03 · TOOL CALL

External calls — account, policy, knowledge base, action.

04 · ANSWER

The candidate's reply, attached to the path that produced it.

05 · JUDGE SCORE

AI judge scoring each criterion with calibration to humans.

06 · HUMAN OVERRIDE

Reviewer accepts, edits, or holds the verdict with reason.

EXAMPLE RUN · SUPPORT ASSISTANT V42

Customer asks for a refund exception after the policy window. The candidate offered partial credit but missed the disclosure copy the policy requires. AI judges scored 82/100 with escalation clarity below threshold. The reviewer holds the release until the disclosure copy is fixed. Forty-one cases route to the regression bank in the same step.

READING · CANDIDATE vs BASELINE
CALIBRATED JUDGES · AI + HUMAN

Two AI judges. One human reviewer. One signed verdict.

The judges run in parallel for scale. Each one is calibrated against a human panel on a rolling sample. When the judges disagree above threshold — or when either one falls outside its calibration band — the case routes to a reviewer with the rubric, the case, and the AI readings attached.

JUDGE A
Policy fidelity

Reads the candidate's answer against the policy and grades grounding, citation accuracy, and contradiction risk.

JUDGE B
Customer impact

Reads the candidate's answer against the customer's stated need and grades resolution, tone, and downstream effect.

REVIEWER
Final disclosure

A human reviewer sees both AI judge readings and the case. They accept, edit, or hold. Their verdict signs the release packet.

WHERE IT FITS

In the loop, this is where you test.

Test the run. Review the hard cases. Recruit the right specialist. Remember the misses. Approve what's right.

01
Test
● YOU ARE HERE
02
Review
03
Recruit
04
Remember
05
Approve
ON THE RECORD · A FRONTIER AI LAB

“We replaced four eval scripts and a slack thread with one rubric and a scorecard. The release meeting takes twenty minutes now. The hold-or-ship decision is already on the page.

Release lead · a frontier AI lab
FAQ · WHAT TEAMS ASK

Four common questions, direct answers.

Q · 01

Does this replace our existing eval suite?

Often, yes — but we can also wrap it. Most teams keep their existing test sets and bring them under one rubric. The deltas are read across all of them.

Q · 02

How do AI judges stay honest?

Each judge has a rolling calibration sample scored by a human panel. If a judge drifts outside its calibration band, runs that depend on it are flagged and the band is re-fit.

Q · 03

What if the rubric itself is wrong?

Every rubric is versioned and signed. When a release surfaces a missing criterion, the rubric gets a new version, the criterion is added, and the prior decisions stay tied to the version that produced them.

Q · 04

Can we run this against custom models?

Yes. The studio is model-agnostic. We have run it against frontier APIs, on-prem checkpoints, and customer-fine-tuned models in the same release gate.

RELATED MODULES

Next to this in the Evaluation OS.

AURAQC

Quality that doesn't end at ship day.

Every issue. Every reviewer. One screen.

See the page →
REGRESSION BANK

Every mistake. Only once.

Every escaped failure becomes a gate the next release cannot cross.

See the page →
COMPLIANCE MONITORING

Compliance that writes itself.

The record builds as the work is done.

See the page →
EVALUATION STUDIO

Test it before it ships.

Bring the rubric your team already trusts. We'll make it the bar every release has to clear.

Evaluation Studio | Structured evaluations with evidence | AuraOne