Know if you're actually ready. Take the Agentic AI quiz → get your AI readiness report.
Take the free test →Evaluation & Tuning: NCP-AAI Domain 3 (13%)
Evaluation & Tuning is 13% of the NCP-AAI exam — and the domain that separates a demo from a production agent. Here's how agents are measured and improved.
What this domain covers
Evaluation and Tuning is about knowing whether your agent actually works and making it work better. Agentic systems are harder to evaluate than single prompts because they take multiple steps, call tools, and can reach a right answer by a wrong path (or vice versa). This domain is 13% of the exam.
Outcome vs trajectory evaluation
There are two complementary lenses. Outcome (or end-to-end) evaluation asks "was the final answer correct?" Trajectory evaluation asks "did it take a sensible path — the right tools, in a reasonable order, without wasted steps?" A good answer reached by luck and a correct path that happened to fail both matter, so production systems track both.
Outcome eval: final_answer == expected? (did it succeed) Trajectory eval: right tools, sane order, no loops? (did it reason well) Cost/latency: tokens, tool calls, wall-clock time (was it efficient)
LLM-as-judge
For open-ended outputs with no single correct answer, a common technique is LLM-as-judge: a separate model scores the output against a rubric. It scales far better than human grading, but it is imperfect — judges have biases (e.g. favoring longer answers) and must themselves be validated against human labels before you trust them.
Building an eval set
You cannot improve what you do not measure. Build a representative dataset of realistic tasks with known-good outcomes, run the agent against it on every change, and track metrics over time. Include hard and edge cases, not just the happy path — that is where regressions hide.
Tuning safely
Improving an agent means changing prompts, tools, models, or parameters — and any change can help one case while silently breaking another. The discipline is to change one thing at a time and re-run the eval set, so you keep only changes that improve aggregate metrics without regressions. Without an eval harness, "tuning" is just guessing.
Exam tip
If an exam scenario describes an agent that "seems better" after a prompt tweak but offers no measurement, the correct action is to evaluate against a fixed eval set — not to ship the change. Evaluation before tuning is the recurring theme of this domain.
Further reading
Think you're ready? Prove it.
Take the free Agentic AI readiness test. Get a score, topic breakdown, and your exact weak areas.
Take the free Agentic AI test →Free · No sign-up · Instant results