Published: Last updated:

AI Evaluation and Guardrails

AI evaluation and guardrails: from a working demo to trustworthy production

An AI system cannot responsibly ship without measurement: evaluation checks systematically whether the quality holds, and guardrails catch at runtime whatever still goes wrong. Together they are the bridge from a working demo to trustworthy production.

A language model that convinces in the demo is not yet a reliable service. The same input may return a different answer tomorrow, contain an invented fact, or break out of its role through a cleverly worded instruction. Anyone embedding AI into a business process therefore needs two mechanisms that complement each other. Evaluation answers, before release, how good the system really is. Guardrails ensure at runtime that a harmful input or a harmful output never gets through in the first place. This page describes why systematic measurement is indispensable for non-deterministic systems, which methods exist for it, how guardrails catch the typical risks, and why both are a permanent part of running AI.

Why evaluation, not inspection

Classic software is deterministic: a test case always yields the same result, and a fixed target value suffices as a check. For a language model this does not hold. The answer varies, it is rarely exactly right or wrong, and three successful examples from development say little about behaviour across a thousand real cases. That is precisely why evaluation replaces gut feeling with a measurement across many cases. It makes two things possible in the first place: deciding whether a new version is better than the old one, and proving that a given quality is reached. Without that measurement, every statement about an AI's quality remains a claim, and without proof there can be no AI governance that ties a model approval to demonstrable quality.

Methods of evaluation

No single method covers everything; in practice they are combined.

  • Test sets with reference answers. A curated collection of inputs with desired outputs against which every new version runs. It fits where there is a clearly correct answer, for example in extraction or classification, and forms the reproducible baseline.
  • LLM-as-judge. A second model rates the first model's answer against a clear criterion, such as correctness, relevance or tone. This scales where there is no single correct answer, but it needs calibration itself so the judge does not introduce its own biases.
  • Human feedback. Domain experts rate samples or flag errors from production. It is the most expensive but most reliable yardstick, and it supplies fresh material for the test sets at the same time.

The interplay produces a robust picture: test sets catch regressions automatically, LLM-as-judge covers the open cases at breadth, and human feedback calibrates both and catches what no automation sees.

Guardrails, the rails at runtime

Evaluation checks before release; guardrails act during operation. They are filters and checks around the model that reject a harmful input or catch a questionable output before it reaches the user. Three risks stand out:

  • Hallucination. The model invents a plausible-sounding but false fact. A guardrail checks the output against the retrieved source and blocks or flags answers that cannot be substantiated.
  • Data leak. An answer contains personal data or internal material that must not leave the building. An output filter detects and redacts such content before it leaves the system.
  • Prompt injection and jailbreak. A manipulated input tries to break the model out of its role or to produce forbidden content. This attack class is also a security topic and belongs in the Security Strategy; from the AI-quality angle, what counts is that an input filter detects the attempt and trust in the system is preserved.

Guardrails do not replace evaluation, they complement it: evaluation lowers the probability that something goes wrong, guardrails limit the damage when it happens anyway. The aspiration to build AI that is not only capable but also accountable is the values layer that Digital Ethics develops.

The check-and-protect path

Evaluation and guardrails act at different points, but they read as one continuous path: measured once before release, protected at runtime on every request.

flowchart TD
    A["Input"] --> B["Input guardrail<br/>injection, jailbreak"]
    B --> C["Model<br/>plus retrieved context"]
    C --> D["Output guardrail<br/>hallucination, data leak"]
    D --> E["Answer to user"]
    F["Evaluation before release<br/>test sets, LLM-as-judge, feedback"] -.-> B
    F -.-> D
    E -.-> G["Traces and feedback<br/>back into the test sets"]
    G -.-> F

The solid line is the runtime path of each individual request; the dashed lines show how evaluation before release sets the guardrail thresholds and how production traces flow back into the test sets as fresh check material. Two separate measures thus become a closed loop that calibrates better with every iteration.

Eval and guardrails as part of LLMOps

Evaluation and guardrails are not a one-off sign-off but a standing task. Models silently change their behaviour, input data shifts, new attack patterns appear. That is exactly why both belong in ongoing operation, that is in LLMOps and MLOps: evaluation is the gate before every release, guardrails are part of observation, and the traces from production feed the next round. The matching tooling is open source and self-hostable, which counts double here: the telemetry of an AI application, that is the real inputs together with their data, stays in-house instead of flowing to a foreign provider. Platforms for evaluation and LLM-as-judge such as Langfuse and Agenta can be run yourself, and specialised tools for guardrails and red-teaming probe a model for weaknesses before it goes into production.

Where evaluation and guardrails break

  • No test set. A new version goes live because it looked good in a few examples. The regression shows up only at the user, with no one having seen it coming.
  • The judge is uncalibrated. LLM-as-judge is used without calibrating the judge against human verdicts. Then the evaluation measures the judge's bias instead of the quality of the answer.
  • Guardrails only at the exit. Only the output is filtered, not the input. A prompt injection takes effect before the output filter even applies.
  • Theatre instead of measurement. There is a metric, but no one acts on it. An evaluation whose result stops no release is decoration.

Records and inputs kept in-house

Evaluation and guardrails work with the most sensitive data of all, the real inputs and outputs of a production application. Run the tooling self-hosted and these records remain in-house instead of flowing to a foreign provider. The classification of which models are approved on which data with what evidence is the Tech Radar and AI Governance service; the evidence-and-supply-chain side, into which eval records and guardrail checks feed, is covered by Security, Compliance and OSPO. The measurement-and-operations layer into which both are embedded is described by Observability and Telemetry. The security view on prompt injection and attack defence is developed by the Security Strategy, the values layer behind it by Digital Ethics. For Swiss organisations this is also the sovereignty question: at a US provider this stream leaves the country, while self-hosted it stays under Swiss data protection.

References


Related topics

Ask AI

These links open external AI services, the conversation and its content are sent to their providers.