2026 is the year when Swiss companies realise: An LLM without observability is a black box that blows up your liability exposure. Every productive AI system produces logs that are 10x to 40x more extensive than classic web services — with prompts, tool calls, costs, hallucinations and drift curves that nobody traditionally monitors. According to the AI Engineering Report 2026, 61% of all AI production systems run without structured observability — with consequences ranging from undetected hallucinations to surprise token cost spikes and Art. 12 EU AI Act violations. This guide shows how we at mazdek build 24/7 observability with ARGUS — OpenTelemetry, evals, drift detection, FinOps and governance in a productive Swiss stack architecture.
What Is LLM Observability in 2026?
LLM observability is the discipline of gaining structured insights from productive prompts, tool calls, responses, evals and costs — in real time, with alerts, drift detection and audit logs. Unlike classic Application Performance Monitoring (APM), LLM observability must observe non-deterministic behaviour: the same input produces different outputs, costs vary by a factor of 3 to 5 per request, and errors are not exceptions but semantic deviations.
The three pillars of modern LLM observability in 2026:
- Tracing: Every LLM call is logged with full input/output attributes, token counts, costs, model, version and session ID. Distributed tracing via W3C Trace Context links nested tool calls and RAG retrieval across multiple services.
- Evaluation (Evals): Automated quality scoring of every output — faithfulness, answer relevance, hallucination rate, toxicity, PII leakage. Without continuous evals, nobody notices the model is slowly drifting.
- FinOps & Governance: Token budgeting per user, team and feature. Granular cost attribution. EU AI Act compliant audit logs. Privacy scrubbing (PII, secrets).
«A productive LLM system without observability is like an aeroplane without a black box. You are flying — but when something goes wrong, you have no idea why. In Switzerland, where FADP, FINMA and the EU AI Act apply, this is no longer a technical luxury problem but a compliance risk. At mazdek we operate more than 47 productive AI systems in 2026 — each of them with seamless tracing, evals and automated alerting through ARGUS.»
— ARGUS, Project Guardian Agent at mazdek
Why LLM Observability Becomes Critical in 2026
Five developments make observability non-negotiable for Swiss companies in 2026:
- Production readiness: In 2024 most AI systems were prototypes. In 2026 they are business critical. A single hallucination bug costs between CHF 800 and CHF 450,000 depending on the use case — lawyer hours, wrong advice, incorrect invoices.
- EU AI Act in force (Art. 12 logs): Since 2 February 2026, every high-risk AI system must record its outputs seamlessly — including model version, input, output, user, timestamp. Without an observability pipeline this is impossible.
- Token cost explosion: With reasoning models (o5, Opus 4.7, Gemini 2.5 Pro), output tokens per request increase by a factor of 5 to 20. A single agentic workflow can run for hours and cost more than CHF 100. Without FinOps control, surprising six-figure monthly bills emerge.
- Model drift: Vendor models change without notice. «gpt-5-turbo» from January 2026 answers slightly differently in April. Without evals and A/B snapshot comparisons, nobody notices — until user complaints escalate.
- Multi-vendor reality: No productive system runs on a single model any more. Typically 3 to 5 providers rotate (Claude, GPT, Gemini, Mistral, local Llamas). Observability is the only way to compare quality and costs between providers.
The Modern LLM Observability Stack 2026
The LLMOps tool landscape has consolidated in 2025/2026. At mazdek we recommend the following stack for Swiss deployments:
| Layer | Tool 2026 | Alternative | Role |
|---|---|---|---|
| Tracing layer | Langfuse (self-hosted CH) | Helicone, Arize Phoenix | Prompt/completion log, session tracking |
| Telemetry protocol | OpenTelemetry + GenAI Semantic Conventions | Custom JSON events | Standardised vendor-neutral tracing |
| Evaluation | Ragas + DeepEval + Custom LLM-as-Judge | Braintrust, Promptfoo | Faithfulness, relevance, toxicity, PII |
| Metrics / alerts | Prometheus + Grafana + Loki | VictoriaMetrics, Datadog | SLO dashboards, multi-tier alerts |
| FinOps / cost | Langfuse Spend + OpenMeter | Vantage, Helicone Cost | Token budget, chargeback, forecasting |
| Guardrails | Guardrails AI + NVIDIA NeMo | LLM Guard, Lakera | PII masking, prompt injection blocks |
| Experiment tracking | MLflow / Weights & Biases | Neptune, ClearML | Prompt versioning, A/B comparisons |
| Swiss hosting | Green / Infomaniak / Swisscom | Exoscale, cyon | FADP, FINMA and revFADP compliance |
The critical point for Swiss deployments: every tool listed is available as a self-hosted open-source variant — which is mandatory as soon as PII or trade secrets flow through the pipeline. SaaS LLMOps services outside the EU/Switzerland are taboo for regulated industries.
The 14 Metrics Every Swiss LLM System Must Track
From our work across 47 productive AI deployments, we have distilled the following metric catalogue. We cluster them into four tiers:
Performance metrics
- Time to First Token (TTFT): Latency until the first output token. Critical for chat UX. Target: < 800 ms p95.
- Tokens per Second (TPS): Streaming speed. Target: > 60 TPS for user-facing flows.
- End-to-end latency p50/p95/p99: Total time including retrieval, tool calls, re-ranking. Our alerting thresholds: p95 > 2.5 s → warning, p99 > 5 s → critical.
Quality metrics (evals)
- Faithfulness score: Does the output match the context/RAG retrieval factually? Measured with LLM-as-Judge or Ragas. Target: > 0.92.
- Answer relevance: Does the output answer the actual question? Target: > 0.88.
- Hallucination rate: Percentage of answers with factual inventions. Target: < 2.5%. Automated detection via Ragas + custom judge.
- Toxicity score: Share of answers with inappropriate content. Target: < 0.2% (was 1–2% in 2024, dropped massively thanks to guardrails).
Cost metrics (FinOps)
- Cost per Request (CPR): Average CHF cost per API call, split into input/output tokens. Our benchmark: CHF 0.003 for support chats, up to CHF 0.45 for agentic workflows.
- Tokens per feature: Distribution of token costs across features or teams. Basis for chargeback and cost optimisation.
- Cache hit ratio: Share of requests resolved via prompt caching (Anthropic, OpenAI, Gemini). Target: > 45%. Savings: up to 90% on input costs for cached prefixes.
Compliance and governance metrics
- PII leakage rate: Share of answers with non-masked personal data. Target: 0 (blocked immediately on detection).
- Prompt injection detection rate: How many malicious prompts are detected and blocked. Baseline: roughly 0.3% of requests carry injection signatures.
- Audit log coverage: Percentage of inference calls with full Art. 12 EU AI Act logs. Target: 100%. Anything less is a compliance violation.
- Model version drift: Change delta in eval scores between two model snapshots. Alert on > 3% regression.
Reference Architecture: ARGUS Observability Stack
Our reference architecture for Swiss deployments consists of six layers. Every mazdek project starts with this template — adapted to the industry (FINMA, revFADP, HIPAA via NINGIZZIDA):
+---------------------------------------------------+
| LLM application (Astro + Hono + Svelte + Python) |
| OTel SDK · traceparent propagation |
+---------------------+-----------------------------+
| OTLP (gRPC / HTTP)
v
+---------------------+-----------------------------+
| OpenTelemetry Collector (Swiss-hosted) |
| GenAI Semantic Conventions · PII scrubber |
| Redacting processor · Batch exporter |
+---+-------------------+-------------------+-------+
| | |
v v v
+---+---------+ +-------+-------+ +---------+------+
| Langfuse | | Prometheus | | Loki |
| (Traces) | | (Metrics) | | (Structured |
| | | | | logs) |
+---+---------+ +-------+-------+ +---------+------+
| | |
v v v
+---+-------------------+-------------------+------+
| Grafana (SLO + alerts + dashboards) |
| Alert Manager -> PagerDuty / Slack / WhatsApp |
+---+-------------------+-------------------+-------+
|
+-------------+-----------+
v v
+---------+-------+ +---------+---------+
| Ragas + DeepEval | | Guardrails AI |
| (LLM-as-Judge) | | (PII / injection) |
+------------------+ +-------------------+
Layer 1: Application Layer 2: OTel Collector Layer 3: Storage
Layer 4: Visualisation + alerting Layer 5: Evals + guardrails
Layer 6: Swiss hosting (Green / Infomaniak / Swisscom)
Layer 1: Application with OTel SDK
Every mazdek application instruments LLM calls with OpenTelemetry. The Python/TypeScript/Rust SDKs ship automatic tracing wrappers for Anthropic, OpenAI, Google and local models via ATLAS. The GenAI Semantic Conventions (an OTel standard since 2025) define consistent attributes such as gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.finish_reason.
Layer 2: OpenTelemetry Collector
A central OTel Collector runs Swiss-hosted and receives all OTLP streams. This is where the critical PII scrubbing work happens: regex-based masking of AHV numbers, credit cards, phone numbers, IBANs. The collector normalises, batches and distributes to backend systems. Without this layer, PII inevitably leaks into the observability tools.
Layer 3: Storage (traces, metrics, logs)
We rely on three specialised backends: Langfuse for LLM-specific traces with prompt/completion details, Prometheus for numerical time series (p95, cost/request), and Loki for structured logs. All three run on-premise or on Swiss hosting — non-negotiable for regulated industries.
Layer 4: Visualisation + alerting
Grafana is the unified UI — with SLO dashboards (SLI, error budget, burn rate) and multi-tier alerting: warning (Slack), high (PagerDuty), critical (WhatsApp via IRIS). Drift alerts, cost burn-rate alerts and PII leak alerts are all orchestrated here.
Layer 5: Evals + guardrails
Evaluation runs continuously in the background. Every n-th trace (or 100% on high-risk flows) is scored by Ragas (RAG metrics), DeepEval (G-Eval framework) and a dedicated Claude Opus-based judge. Guardrails AI blocks PII leaks and prompt injections in real time.
Layer 6: Swiss hosting
The entire observability pipeline runs in Swiss data centres (Green Geneva, Infomaniak Lausanne, Swisscom Zurich). Our HEPHAESTUS DevOps agent provisions Terraform-coded, ISO 27001 certified infrastructure.
Evaluation: The Art of Measuring Non-Deterministic Behaviour
Evals are the decisive discipline that separates classic observability from LLM observability. An LLM can have 99.9% uptime and still deliver massive amounts of wrong answers. Five eval strategies we use at mazdek:
1. Reference-based evals (with gold standard)
When ground truth is available (for example historical FAQ answers), we measure exact match, BLEU, ROUGE and semantic similarity via embeddings. Best for classification, summarisation and transcription.
2. Reference-free evals (LLM-as-Judge)
A separate LLM (typically Claude Opus 4.7 or GPT-5 Turbo) scores the quality. The standard is the G-Eval framework: criteria like «faithfulness», «clarity», «helpfulness» are rated on a 1–5 scale with chain-of-thought prompts. Popular but to be handled with care — the judge can hallucinate itself.
3. RAG-specific metrics (Ragas)
For RAG systems we use the Ragas framework: faithfulness (output supported by retrieval?), answer relevance (answer fits the question?), context precision (retrieval quality) and context recall (coverage of the factual basis). Every metric as a continuous time series.
4. Human-in-the-loop evals
For critical use cases (medicine via NINGIZZIDA, law, financial advice) human assessment remains indispensable. Langfuse offers scoring UIs where domain experts rate individual traces. Sampling: 1–5% of traces.
5. Adversarial evals (red team)
Our ARES Cybersecurity Agent runs continuous red team tests: prompt injection, jailbreaks, data exfiltration via indirect prompt injection. The red-team frameworks PromptFoo and Garak simulate 1,800+ attack vectors repeatedly — results feed into the governance dashboard.
Cost of evals
Evals cost money — every G-Eval scoring consumes tokens. Typical overhead: 15–30% on top of production costs. Our recommendation: 100% evals on high-risk flows, 5–10% sampling on low-risk flows, continuous drift detection on the embedding level.
FinOps for LLMs: Keeping Costs Under Control
In 2025, according to our experience with Swiss companies, on average 38% of LLM spend is wasted — through poorly designed prompts, missing caching, oversized models for simple tasks and absent budgets. The six most important FinOps levers:
- Model routing: Simple tasks (classification, intent) go to Small Language Models (Mistral Small, Phi-4, Llama-3 8B). Only complex reasoning tasks hit frontier models. Cost reduction: 60–80%.
- Prompt caching: Anthropic, OpenAI and Gemini all support prefix caching in 2026. System prompts, RAG contexts and few-shot examples are tokenised once — subsequent calls pay 10% of the input price. Typical savings: 45–72%.
- Token budgeting: Hard budgets per user/team/feature in CHF per month. OpenMeter and Langfuse provide the metering backend. At 80% burn rate: warning. At 100%: downgrade to a cheaper model instead of blocking.
- Batch inference: For non-interactive workloads (reports, file analysis), use the batch APIs from Anthropic/OpenAI — 50% discount on 24h turnaround. Savings on report pipelines: up to 65%.
- Prompt compression: LLMLingua and similar tools shrink prompts to 30–50% of their original size without quality loss. Critical for repeated multi-step agent workflows.
- Chargeback & showback: Tag every trace with cost centre, user, feature. Monthly chargeback reports per team. Nothing disciplines dev teams faster than internal CHF invoices.
Governance: EU AI Act Art. 12 in Practice
The EU AI Act has been fully in force since 2 February 2026. Article 12 is the most important one for observability — it requires «automatic recording of events (logs)» for the entire lifespan of every high-risk system. Concrete requirements:
- Mandatory logs: Every inference call must contain date/time, input ID, output ID, model, version, user and result hash.
- Retention: At least 6 months, typically 10 years for regulated industries (FINMA, medicine).
- Immutability: Write-once storage with a cryptographic audit trail is recommended (Merkle tree over log segments).
- Access separation: Operators have access, developers typically only to the masked variant.
For Swiss companies, additional layers apply:
- revFADP Art. 7 (data security): TLS 1.3 in transit, AES-256 at rest, role-based access control.
- revFADP Art. 16 (cross-border disclosure): Prohibits exporting logs with PII abroad without adequate protection. Consequence: Langfuse, Prometheus and Loki must be Swiss-hosted as soon as PII is involved.
- FINMA Circ. 2018/3 (outsourcing): Seamless traceability of every tool decision for auditors.
- Art. 321 SCC (professional secrecy): Lawyers and doctors may only store logs on FADP-compliant infrastructure.
Our ARES Cybersecurity Agent delivers the governance templates; ARGUS orchestrates continuous compliance.
Observability Platforms in Direct Comparison
| Platform | Open source | Self-hosted | Evals | Swiss fit | When to choose |
|---|---|---|---|---|---|
| Langfuse | Yes (MIT) | Yes | Native | Yes, self-hosted | Standard for mazdek projects |
| Arize Phoenix | Yes (Apache 2) | Yes | Native | Yes, self-hosted | Strong ML drift capabilities |
| Helicone | Yes | Yes | Yes | Possible | Proxy-based integration |
| LangSmith | No | Enterprise only | Yes | Only with EU contract | If LangChain dominates |
| Braintrust | No | No | Strong | Problematic | Mostly US teams |
| Datadog LLM Obs. | No | No | Limited | EU region only | When Datadog is already in the stack |
| OpenLLMetry (OSS) | Yes | Yes | External | Yes | Lightweight OTel integration |
Our standard recommendation for Swiss SMEs and mid-market: Langfuse self-hosted with OTel Collector, Prometheus, Loki and Grafana — all open source, all Swiss-host-fit. For enterprises with an existing Datadog/Dynatrace stack: incremental integration using GenAI Conventions.
Code Sample: Fully Instrumented LLM Call
This is what a fully instrumented LLM call at mazdek looks like — TypeScript with OTel SDK, Langfuse and automatic eval trigger:
import { trace, context, SpanStatusCode } from '@opentelemetry/api'
import { Langfuse } from 'langfuse'
import { Anthropic } from '@anthropic-ai/sdk'
const tracer = trace.getTracer('mazdek-chat', '1.0.0')
const langfuse = new Langfuse({ baseUrl: 'https://langfuse.internal.mazdek.ch' })
const anthropic = new Anthropic()
export async function answerUserQuestion(userId: string, question: string, ragContext: string) {
return tracer.startActiveSpan('llm.answer_question', async (span) => {
// Set semantic conventions
span.setAttributes({
'gen_ai.system': 'anthropic',
'gen_ai.request.model': 'claude-opus-4-7',
'gen_ai.user.id': userId,
'mazdek.feature': 'customer_chat',
'mazdek.rag_context_bytes': ragContext.length,
})
const lfTrace = langfuse.trace({ name: 'customer_chat', userId })
try {
const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
max_tokens: 1024,
system: `You are the mazdek support agent. Answer ONLY based on the context.
Context: ${ragContext}`,
messages: [{ role: 'user', content: question }],
})
// Log tokens & cost
span.setAttributes({
'gen_ai.usage.input_tokens': response.usage.input_tokens,
'gen_ai.usage.output_tokens': response.usage.output_tokens,
'gen_ai.response.finish_reason': response.stop_reason || 'unknown',
})
const text = response.content[0].type === 'text' ? response.content[0].text : ''
// Langfuse generation with full detail
const generation = lfTrace.generation({
name: 'answer',
model: 'claude-opus-4-7',
input: { question, ragContext },
output: text,
usage: {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
},
})
// Trigger async eval (non-blocking)
queueFaithfulnessEval({
traceId: lfTrace.id,
question,
context: ragContext,
answer: text,
})
span.setStatus({ code: SpanStatusCode.OK })
return text
} catch (err) {
span.recordException(err as Error)
span.setStatus({ code: SpanStatusCode.ERROR, message: (err as Error).message })
throw err
} finally {
span.end()
}
})
}
Everything that happens here automatically: traceparent propagation via HTTP headers into RAG and vector DB services, cost attribution via OTel attributes for FinOps dashboards, async eval for faithfulness tracking, error capture for alerting. Our ATLAS Languages Agent ships equivalent templates for Python (openinference), Rust (opentelemetry-rust) and Go.
Case Study: St. Gallen Insurer Reduces Hallucinations by 71%
A Swiss property insurer (420 employees, CHF 780 million in premium volume) had been running a RAG-based chatbot for claims handling since mid-2025. The problem: users complained about invented contract clauses and wrong deadline information. Internal nickname: «The HalluciBot».
Starting point October 2025
- No observability: only LLM provider dashboards, no prompt/completion logs
- No evals: quality was measured through monthly manual spot checks
- Hallucination rate (measured retroactively): 8.7%
- P95 latency: 4.2 s (timeout complaints)
- Monthly LLM cost: CHF 12,400 — 52% outliers due to failed tool calls in loops
- FINMA supervisory letter Q4 2025: «Traceability of the automated advice insufficient»
The mazdek transformation: 10 weeks, 5 agents
We orchestrated the transformation with:
- ARGUS: Observability architecture, SLO dashboards, alerting. Langfuse self-hosted at Green Geneva, Prometheus, Loki, Grafana.
- PROMETHEUS: Eval framework with Ragas + Claude Opus judge, continuous hallucination scoring.
- ARES: PII scrubber inside the OTel Collector, prompt injection guardrails, FINMA-compliant audit logs with Merkle tree.
- HEPHAESTUS: Terraform-coded infrastructure on Swiss cloud, ISO 27001 pipeline.
- HERACLES: Model routing between Claude Sonnet (simple questions) and Claude Opus (complex claims), prompt caching optimisation.
Results after 14 weeks
| Metric | Before (Oct 2025) | After (Feb 2026) | Improvement |
|---|---|---|---|
| Hallucination rate | 8.7% | 2.5% | -71% |
| Faithfulness score | 0.74 | 0.94 | +27% |
| P95 latency | 4.2 s | 1.6 s | -62% |
| Monthly LLM cost | CHF 12,400 | CHF 5,200 | -58% |
| Cache hit ratio | 0% | 64% | +64% |
| Hallucination detection time | ~11 days | < 90 seconds | -99.9% |
| FINMA supervisory letter Q2 2026 | Objections | No objections | Compliance achieved |
| Mean Time to Resolve (MTTR) | 3.5 h | 18 min | -91% |
| Annual LLM spend savings | — | CHF 86,400 | ROI in 3.7 months |
The decisive turning point did not come from a single trick but from the combination of tracing, evals, model routing and caching. Each individual measure would have produced only a third of the effect.
Implementation Roadmap: From Zero to Observability in 8 Weeks
Our proven 5-phase process for Swiss companies:
Phase 1: Audit & baseline (week 1)
- Inventory: which LLM calls run where, with which models, at what cost?
- Identify critical flows (high-risk tasks: advisory, compliance, healthcare)
- Compliance gap analysis (EU AI Act, FADP, FINMA, industry-specific)
- Risk ranking by ARES
Phase 2: OTel instrumentation (weeks 2–3)
- OTel SDK into every app (TS/Python/Rust/Go)
- Enforce GenAI Semantic Conventions
- Collector deployment with PII scrubber
- Langfuse self-hosted on Swiss hosting via HEPHAESTUS
Phase 3: Dashboards & alerts (weeks 4–5)
- Grafana dashboards for performance, quality, cost, compliance
- SLO definitions: p95 < 2.5 s, faithfulness > 0.92, hallucination < 2.5%
- Multi-tier alerting (Slack / PagerDuty / WhatsApp)
- On-call rotation with playbooks via ARGUS Guardian
Phase 4: Evals & guardrails (weeks 6–7)
- Ragas + DeepEval + custom judge for high-risk flows
- Guardrails AI for PII masking and prompt injection blocks
- Red team integration via ARES with PromptFoo
- Human-in-the-loop scoring for compliance-critical processes
Phase 5: FinOps & continuous optimisation (week 8+)
- Token budgeting per team/feature via OpenMeter
- Implement model routing and prompt caching
- Monthly chargeback reports
- Quarterly red-team audits and policy reviews
The Future: Agentic Observability and Governance Automation
LLM observability in 2026 is only the beginning. What we expect for 2027+:
- Agentic traces: Multi-step agent workflows (10–100+ nested LLM calls) require new visualisations. First products: Langfuse Sessions, Arize Phoenix Agent Traces.
- Self-healing pipelines: ARGUS-like guardians that trigger model rollbacks, prompt optimisations and parameter tuning automatically — see our Self-Repairing AI article.
- Observability MCP: Observability data becomes queryable for AI agents via the Model Context Protocol. «Why were yesterday's costs higher?» → agent accesses Langfuse via MCP.
- EU AI Act certification logs: Standardised log formats that can be transmitted directly to supervisory authorities for Art. 12 compliance.
- Observability as code: Dashboards, alerts and evals as git-versioned Terraform/Pulumi definitions. Part of our Swiss sovereign AI stack.
Conclusion: Observability Is the Difference Between Prototype and Product
The key takeaways for Swiss decision makers in 2026:
- Compliance mandate: Without seamless logging and evals, EU AI Act compliance is impossible in 2026. This is not a technical nice-to-have but a legal obligation.
- Quality lever: In our insurance case the hallucination rate dropped by 71% — purely through structured observability. No new model magic, no new prompts.
- Cost lever: 38–58% savings on LLM costs through FinOps practices (model routing, caching, budgeting) — derived directly from observability data.
- Swiss stack imperative: For regulated industries, self-hosted observability (Langfuse, Prometheus, Grafana, Loki) on Swiss hosting is the only FADP-compliant path.
- The time is now: Every day without observability is a day with undetected problems, surprise bills and growing compliance risk.
At mazdek, 19 specialised AI agents orchestrate the entire observability chain: ARGUS for 24/7 monitoring, PROMETHEUS for evals, ARES for guardrails and compliance, HEPHAESTUS for Swiss-host infrastructure, HERACLES for model routing and FinOps. More than 47 productive AI systems for Swiss companies run on this architecture — revFADP, GDPR, EU AI Act and FINMA compliant from day one.