How much of the NCP-AAI exam is Evaluation and Tuning?

Evaluation and Tuning is 13% of the NCP-AAI exam.

What is the difference between outcome and trajectory evaluation?

Outcome evaluation asks whether the final answer was correct. Trajectory evaluation asks whether the agent took a sensible path — the right tools, in a reasonable order, without wasted steps. Production systems track both, plus cost and latency.

What is LLM-as-judge and when should you use it?

LLM-as-judge uses a separate model to score open-ended outputs against a rubric where there is no single correct answer. It scales better than human grading but has biases and must be calibrated against human-labeled examples before you trust it.

Why do you need an eval set before tuning an agent?

Because any change to prompts, tools, or models can help one case while silently breaking another. Running a fixed, representative eval set on every change lets you keep only the changes that improve aggregate metrics without regressions.

Home›Guides›Agentic AI›Evaluating AI Agents — Metrics & Tuning (NCP-AAI Domain, 13%)

Know if you're actually ready. Take the Agentic AI quiz → get your AI readiness report.

Take the free test →

🤖 Agentic AI

Evaluation & Tuning: NCP-AAI Domain 3 (13%)

Evaluation & Tuning is 13% of the NCP-AAI exam — and the domain that separates a demo from a production agent. Here's how agents are measured and improved.

Examifyr·2026·7 min read

What this domain covers

Evaluation and Tuning is about knowing whether your agent actually works and making it work better. Agentic systems are harder to evaluate than single prompts because they take multiple steps, call tools, and can reach a right answer by a wrong path (or vice versa). This domain is 13% of the exam.

Outcome vs trajectory evaluation

There are two complementary lenses. Outcome (or end-to-end) evaluation asks "was the final answer correct?" Trajectory evaluation asks "did it take a sensible path — the right tools, in a reasonable order, without wasted steps?" A good answer reached by luck and a correct path that happened to fail both matter, so production systems track both.

Outcome eval:     final_answer == expected?            (did it succeed)
Trajectory eval:  right tools, sane order, no loops?    (did it reason well)
Cost/latency:     tokens, tool calls, wall-clock time   (was it efficient)

LLM-as-judge

For open-ended outputs with no single correct answer, a common technique is LLM-as-judge: a separate model scores the output against a rubric. It scales far better than human grading, but it is imperfect — judges have biases (e.g. favoring longer answers) and must themselves be validated against human labels before you trust them.

Note: LLM-as-judge is a measurement tool, not ground truth. Always calibrate the judge against a human-labeled sample, and watch for known biases like length and position effects.

Building an eval set

You cannot improve what you do not measure. Build a representative dataset of realistic tasks with known-good outcomes, run the agent against it on every change, and track metrics over time. Include hard and edge cases, not just the happy path — that is where regressions hide.

Tuning safely

Improving an agent means changing prompts, tools, models, or parameters — and any change can help one case while silently breaking another. The discipline is to change one thing at a time and re-run the eval set, so you keep only changes that improve aggregate metrics without regressions. Without an eval harness, "tuning" is just guessing.

Exam tip

If an exam scenario describes an agent that "seems better" after a prompt tweak but offers no measurement, the correct action is to evaluate against a fixed eval set — not to ship the change. Evaluation before tuning is the recurring theme of this domain.

Think you're ready? Prove it.

Take the free Agentic AI readiness test. Get a score, topic breakdown, and your exact weak areas.

Take the free Agentic AI test →

Free · No sign-up · Instant results

← Previous

AI Agent Development — Tool Use & Building Agents (NCP-AAI 15%)

Deploying & Scaling AI Agents — NCP-AAI Domain Guide (13%)

← All Agentic AI guides