ReleaseOps
Release engineering for AI agent behavior — bundle, promote, evaluate, and observe behavior artifacts
Why ReleaseOps?
AI agents ship behavior through prompts, policies, and model configurations — not deterministic code.
When something breaks in production, there's no git blame for "why did the agent start approving
refunds it shouldn't?"
ReleaseOps brings standard release engineering to these behavior artifacts, so you always know what's running, what changed, and why.
Features
Bundle Creation
Compose prompts, policies, and model configs into immutable, content-addressed artifacts verified with SHA-256.
Gated Promotion
Move bundles through dev → staging → prod with configurable quality gates: evaluation, approval, and soak time.
Instant Rollback
Revert to any previous bundle version instantly with a full audit trail of every promotion and rollback.
Automated Evaluation
Run test suites with pluggable judges — exact match, regex, LLM-as-judge, or composite judges.
OpenTelemetry Integration
Automatically inject bundle metadata into OTel spans for production observability and tracing.
Behavior Attribution
Trace agent behavior back to specific prompt lines and policy rules with confidence scoring.
Installation
pip install llmhq-releaseops
| Extra | Install | Adds |
|---|---|---|
eval | pip install llmhq-releaseops[eval] | LLM-as-judge (OpenAI, Anthropic) |
langsmith | pip install llmhq-releaseops[langsmith] | LangSmith trace queries |
dev | pip install llmhq-releaseops[dev] | pytest, black, mypy |
Quick Start
# Initialize release infrastructure
releaseops init
# Create a bundle from prompts and model config
releaseops bundle create support-agent \
--artifact system=onboarding:v1.2.0 \
--model claude-sonnet-4-5 --provider anthropic
# Promote through environments
releaseops promote promote support-agent 1.0.0 dev
releaseops promote promote support-agent 1.0.0 staging
releaseops promote promote support-agent 1.0.0 prod
# Check environment status
releaseops env list
# Compare versions when something changes
releaseops analytics compare support-agent@1.0.0 support-agent@1.1.0
Python SDK
from llmhq_releaseops.runtime import RuntimeLoader
loader = RuntimeLoader()
bundle, metadata = loader.load_bundle("support-agent@prod")
# Access bundle data
model = bundle.model_config.model # "claude-sonnet-4-5"
temperature = bundle.model_config.temperature # 0.7
prompts = bundle.prompts # Dict[str, ArtifactRef]
policies = bundle.policies # Dict[str, ArtifactRef]
# Metadata auto-injected into OpenTelemetry spans
Async support
from llmhq_releaseops.runtime import AsyncRuntimeLoader
async_loader = AsyncRuntimeLoader()
bundle, metadata = await async_loader.load_bundle("support-agent@prod")
Evaluation Engine
Pluggable judge system for testing agent behavior before promotion. Eval reports gate promotion — a bundle cannot reach prod without a passing report.
Judge types
# Create and run an eval suite
releaseops eval create support-eval --bundle support-agent@dev
releaseops eval run support-eval
releaseops eval report support-eval # markdown or JSON output
# Promotion is BLOCKED if no passing eval report exists
releaseops promote promote support-agent 1.1.0 prod
# Error: no passing eval report for support-agent 1.1.0
# Override with --skip-gates (emergency only)
releaseops promote promote support-agent 1.1.0 prod --skip-gates
Python eval suite
from llmhq_releaseops.models.eval_suite import (
EvalSuite, EvalCase, Assertion, JudgeType
)
suite = EvalSuite(
id="support-eval",
cases=[
EvalCase(
id="small-refund",
input={"amount": "$30", "reason": "item not received"},
assertions=[
Assertion(
judge=JudgeType.CONTAINS,
expected="approved"
)
]
),
EvalCase(
id="medium-refund",
input={"amount": "$120", "reason": "changed mind"},
assertions=[
Assertion(
judge=JudgeType.LLM,
expected="agent should not escalate"
)
]
),
]
)
Behavior Attribution
When behavior changes between versions, attribution traces the agent action back to the exact artifact lines that influenced it — prompt lines, policy rules, or model config — with confidence scoring.
releaseops attribution explain support-agent 1.1.0 \
--action "approved refund for $120"
# Primary influence: system_prompt (HIGH, confidence 0.87)
# Line 15: "Auto-approve refund requests up to $200"
#
# Secondary: tools_policy (LOW, confidence 0.22)
# Section: refund_tool — no relevant constraint found
#
# Overall assessment: Expected
Verdicts: Expected — behavior matches artifact intent · Unexpected — behavior deviates · Contradicts artifacts — behavior opposes an explicit rule.
Python API
from llmhq_releaseops.attribution.analyzer import AttributionAnalyzer
analyzer = AttributionAnalyzer(store, prompt_bridge)
attribution = analyzer.analyze(trace_data, "support-agent", "1.1.0")
print(attribution.primary_influence) # Highest-confidence explanation
print(attribution.overall_assessment) # "Expected" / "Unexpected" / "Contradicts artifacts"
# Batch analysis
releaseops attribution analyze-batch support-agent 1.1.0 --traces traces.json
LangSmith Integration
Query LangSmith traces filtered by ReleaseOps bundle metadata. Aggregate behavioral metrics directly from your LangSmith project.
pip install llmhq-releaseops[langsmith]
import os
from llmhq_releaseops.analytics.platforms.langsmith import LangSmithPlatform
from llmhq_releaseops.analytics import TraceQuerier, MetricsAggregator
platform = LangSmithPlatform(api_key=os.environ["LANGSMITH_API_KEY"])
querier = TraceQuerier(platform)
# Fetch traces tagged with ReleaseOps bundle metadata
traces = querier.query_by_bundle("support-agent", "1.1.0", "prod")
metrics = MetricsAggregator().aggregate(traces, "support-agent", "1.1.0", "prod")
# CLI equivalent
# LANGSMITH_API_KEY=... releaseops analytics metrics support-agent@prod
# LANGSMITH_API_KEY=... releaseops analytics compare support-agent@1.0.0 support-agent@1.1.0
Key concepts
Requirements
- Python 3.10+
- Git (required for storage)
- Dependencies: Typer, PyYAML, Jinja2, GitPython, OpenTelemetry, llmhq-promptops
See it in action
Watch ReleaseOps bundle prompts, promote through environments, and trace behavioral changes.