What Changed Between Versions?

An IT helpdesk agent handles tickets two ways: conservative (route to humans) vs autonomous (self-resolve). Analytics quantifies the behavioral shift.

🎛

The Scenario

An IT helpdesk uses an agent to triage and route support tickets. v1.0.0 is conservative, v2.0.0 handles more tickets autonomously.

Ticket "I can't connect to the VPN from home"
v1.0.0
👤
Route to IT staff
Strategy: route most to humans
vs
v2.0.0
Auto-resolve with steps
Strategy: self-resolve common issues
v1.0.0 Route 70% to humans v2.0.0 Auto-resolve 60%
1.0x

The Key Moment: The Trade-Off

v2.0.0 auto-resolves 3x more tickets (20% → 60%) but uses 56% more tokens and is 76% slower.

Analytics surfaces both sides: the improvement (fewer tickets to human staff) and the cost (higher latency and token usage). The assessment is MIXED — you decide if the trade-off is acceptable.

What the Demo Shows

Act 1 — Load

Metrics Aggregation

Aggregate traces into behavioral metrics: latency percentiles, token usage, error rates, and action distributions. See what your agent actually does, not just how fast it runs.

Act 2 & 3 — Measure

Action Distribution

The real story is in what actions the agent takes. How many tickets get auto-resolved vs routed to humans vs escalated? Action distribution shows the practical impact.

Act 4 — Compare

Version Comparison

Compare any two versions with significance levels: MAJOR (> 25%), moderate (> 10%), minor (> 5%), negligible. Overall assessment: improvement, regression, mixed, or neutral.

Insight

Significance Levels

Not all changes matter equally. A 2% latency increase is negligible. A 200% increase in auto-resolution is major. Significance levels help you focus on what actually matters.

Try It Yourself

Analytics runs locally with MockPlatform for testing. Connect to LangSmith for production trace data.

pip install llmhq-promptops llmhq-releaseops
releaseops analytics metrics my-agent@1.0.0
releaseops analytics compare \
  my-agent@1.0.0 my-agent@2.0.0