LLMOps and MLOps
LLMOps and MLOps: AI in production must be measurable, repeatable and auditable
AI in production is an operational discipline, not a one-off trained model: LLMOps and MLOps turn the successful prototype into a repeatable, measurable and evidenced service.
The demo effect deceives. A model that convinces in a notebook is still far from a service a team can rely on. The moment AI runs in production, it raises the same questions as any other software: which version is running right now, how good is it really, what does it cost, and when does it start to drift? MLOps answered these questions for classic machine-learning models. LLMOps carries the answer over to Language Models and adds what is new about large models. This page describes how the two differ, what the lifecycle looks like, and why the right tooling should be self-hostable.
Why LLMOps is not the same as MLOps
MLOps is the mature discipline: datasets, training code and models are versioned, tested, shipped through automation and watched for model drift in operation. Its core principles, versioning, testing, automation, monitoring and reproducibility, apply unchanged to language models too. Three properties of large models shift the centre of gravity, however:
- Non-deterministic. The same input can produce two different answers. A fixed expected value is not enough; it takes an evaluation that judges quality statistically across many cases.
- Prompt-as-code. The behaviour often sits not in the model weights but in the prompt and the retrieved context. That makes the prompt a versioned artefact with its own lifecycle, a subject deepened by prompt and context engineering.
- Foreign inference. Buying a model via API means its internals are unknown, and a running cost per call is incurred. Cost, latency and token usage become first-order operational metrics.
Classic MLOps trains and monitors a model the organisation owns. LLMOps orchestrates and monitors a model that is often only called, and turns the prompt, the evaluation and the cost into the real object of control.
The LLMOps lifecycle
LLMOps is a loop, not a project with an end date. Every prompt change, every model swap and every new use case runs through the same sequence, which binds four steps into a closed cycle:
flowchart TD
A["Version<br/>prompt, model, context"] --> B["Evaluate<br/>test sets, LLM-as-judge"]
B --> C["Ship<br/>controlled, with a fallback path"]
C --> D["Observe<br/>cost, latency, drift, traces"]
D --> A
The loop is what separates a one-off hit from a service. Versioning records which prompt belongs with which model and which context. Evaluation checks before every release whether the new version beats the old one, instead of trusting a gut feeling from three examples; it is also the bridge to AI evaluation and guardrails. Shipping follows the same patterns as CI/CD, with a controlled rollout and a fallback path. Observation, finally, surfaces in operation what a trace captures per request: the prompt sent, the answer, the token usage and the latency. This telemetry is AI-specific observability, and the open standard for it, the generative-AI conventions of OpenTelemetry, makes sure it slots into an existing monitoring landscape.
Self-hostable tooling
The tool landscape is the most volatile layer of this page. What matters is not the individual name but that the dependable tools are open source and self-hostable. That is exactly what decides data ownership: the telemetry of an AI application, meaning the real user requests including their data, stays in-house instead of flowing to a foreign provider.
- Langfuse bundles tracing, prompt management and evaluation in an open-source platform and speaks OpenTelemetry.
- Agenta combines prompt versioning, evaluation including LLM-as-judge and observation under an MIT licence.
- LiteLLM provides a self-hostable gateway with a unified interface to many models, with virtual keys and cost tracking.
- PostHog AI Observability measures cost, latency and traces in product behaviour and is open source and self-hostable.
Which tool fits depends on the use case; the shared property is data residency. Self-hosting the telemetry keeps control where data governance demands it, and makes AI operations compatible with Swiss data protection.
Where LLMOps breaks
- Prompt sprawl. Prompts live scattered through the code and no one knows which version is in production. Without versioning, a regression can neither be explained nor rolled back.
- No evaluation. A new version goes live without a test set because it looked good in three examples. The regression shows up only at the user.
- Blind operation. Cost, latency and quality are not measured. The first sign of a problem is the bill or the complaint.
- Drift unnoticed. A bought-in model quietly changes its behaviour, or the input data shifts. Without observation, the creeping decay stays invisible.
Telemetry kept in-house
A production AI application continuously processes real input, and that telemetry is the most sensitive data stream an organisation holds. Run the observability tooling self-hosted and this stream stays in-house instead of flowing to a foreign provider. The discipline of making AI operations measurable and evidenced is the Observability and Telemetry service; the evidenced AI on in-house infrastructure that this tooling is embedded into is covered by the Sovereign RAG architectures competence. The delivery and platform side that LLMOps builds on is described by Platform Engineering; the AI-specific development view is covered by AI development. The measurement and control layer that embeds LLMOps into oversight is part of AI governance. For Swiss organisations this is also the sovereignty question: at a US provider the telemetry stream leaves the country, while self-hosted it stays under Swiss data protection.
References
- Langfuse LLM Observability and Application Tracing. Open-source platform for tracing, prompt management and evaluation, self-hostable and OpenTelemetry-compatible. (2026). langfuse.com/docs/observability/overview
- Agenta Open-source LLMOps platform. Prompt versioning, evaluation with LLM-as-judge and observation under an MIT licence, self-hostable. (2026). github.com/Agenta-AI/agenta
- LiteLLM LLM Gateway and unified interface. Self-hostable gateway with a unified interface to many models, virtual keys and cost tracking. (2026). docs.litellm.ai/docs/
- PostHog AI Observability. Observation of cost, latency and traces for AI products, open source and self-hostable under an MIT licence. (2026). posthog.com/ai-observability
- OpenTelemetry Semantic Conventions for Generative AI. The open standard for traces, spans and attributes of AI systems, vendor-neutral. (2024). github.com/open-telemetry/semantic-conventions-genai
- ml-ops.org MLOps Principles. The core principles of production machine learning: versioning, testing, automation, monitoring and reproducibility. (2020). ml-ops.org/content/mlops-principles
Related topics
- Observability, the AI-specific telemetry the observation builds on.
- CI/CD, the delivery patterns LLMOps follows.
- AI development, the development view before operation.
- Data Governance, the requirement on the telemetry's data residency.
- Observability and Telemetry, the commercial service counterpart.
Ask AI
These links open external AI services, the conversation and its content are sent to their providers.