ReleaseOps

Release engineering for AI agent behavior — bundle, promote, evaluate, and observe behavior artifacts

Early Access v0.1.0 Python 3.10+

Why ReleaseOps?

AI agents ship behavior through prompts, policies, and model configurations — not deterministic code. When something breaks in production, there's no git blame for "why did the agent start approving refunds it shouldn't?"

ReleaseOps brings standard release engineering to these behavior artifacts, so you always know what's running, what changed, and why.

Features

Bundle Creation

Compose prompts, policies, and model configs into immutable, content-addressed artifacts verified with SHA-256.

Gated Promotion

Move bundles through dev → staging → prod with configurable quality gates: evaluation, approval, and soak time.

Instant Rollback

Revert to any previous bundle version instantly with a full audit trail of every promotion and rollback.

Automated Evaluation

Run test suites with pluggable judges — exact match, regex, LLM-as-judge, or composite judges.

OpenTelemetry Integration

Automatically inject bundle metadata into OTel spans for production observability and tracing.

Behavior Attribution

Trace agent behavior back to specific prompt lines and policy rules with confidence scoring.

Installation

terminal bash
pip install llmhq-releaseops
ExtraInstallAdds
evalpip install llmhq-releaseops[eval]LLM-as-judge (OpenAI, Anthropic)
langsmithpip install llmhq-releaseops[langsmith]LangSmith trace queries
devpip install llmhq-releaseops[dev]pytest, black, mypy

Quick Start

terminal bash
# Initialize release infrastructure
releaseops init

# Create a bundle from prompts and model config
releaseops bundle create support-agent \
  --artifact system=onboarding:v1.2.0 \
  --model claude-sonnet-4-5 --provider anthropic

# Promote through environments
releaseops promote promote support-agent 1.0.0 dev
releaseops promote promote support-agent 1.0.0 staging
releaseops promote promote support-agent 1.0.0 prod

# Check environment status
releaseops env list

# Compare versions when something changes
releaseops analytics compare support-agent@1.0.0 support-agent@1.1.0

Python SDK

app.py python
from llmhq_releaseops.runtime import RuntimeLoader

loader = RuntimeLoader()
bundle, metadata = loader.load_bundle("support-agent@prod")

# Access bundle data
model       = bundle.model_config.model        # "claude-sonnet-4-5"
temperature = bundle.model_config.temperature  # 0.7
prompts     = bundle.prompts                   # Dict[str, ArtifactRef]
policies    = bundle.policies                  # Dict[str, ArtifactRef]

# Metadata auto-injected into OpenTelemetry spans

Async support

async_app.py python
from llmhq_releaseops.runtime import AsyncRuntimeLoader

async_loader = AsyncRuntimeLoader()
bundle, metadata = await async_loader.load_bundle("support-agent@prod")

Evaluation Engine

Pluggable judge system for testing agent behavior before promotion. Eval reports gate promotion — a bundle cannot reach prod without a passing report.

Judge types

ExactMatch Contains Regex LLM-as-judge Composite
terminal bash
# Create and run an eval suite
releaseops eval create support-eval --bundle support-agent@dev
releaseops eval run support-eval
releaseops eval report support-eval          # markdown or JSON output

# Promotion is BLOCKED if no passing eval report exists
releaseops promote promote support-agent 1.1.0 prod
# Error: no passing eval report for support-agent 1.1.0

# Override with --skip-gates (emergency only)
releaseops promote promote support-agent 1.1.0 prod --skip-gates

Python eval suite

eval_suite.py python
from llmhq_releaseops.models.eval_suite import (
    EvalSuite, EvalCase, Assertion, JudgeType
)

suite = EvalSuite(
    id="support-eval",
    cases=[
        EvalCase(
            id="small-refund",
            input={"amount": "$30", "reason": "item not received"},
            assertions=[
                Assertion(
                    judge=JudgeType.CONTAINS,
                    expected="approved"
                )
            ]
        ),
        EvalCase(
            id="medium-refund",
            input={"amount": "$120", "reason": "changed mind"},
            assertions=[
                Assertion(
                    judge=JudgeType.LLM,
                    expected="agent should not escalate"
                )
            ]
        ),
    ]
)

Behavior Attribution

When behavior changes between versions, attribution traces the agent action back to the exact artifact lines that influenced it — prompt lines, policy rules, or model config — with confidence scoring.

terminal bash
releaseops attribution explain support-agent 1.1.0 \
  --action "approved refund for $120"
attribution output
# Primary influence: system_prompt  (HIGH, confidence 0.87)
#   Line 15: "Auto-approve refund requests up to $200"
#
# Secondary: tools_policy  (LOW, confidence 0.22)
#   Section: refund_tool — no relevant constraint found
#
# Overall assessment: Expected

Verdicts: Expected — behavior matches artifact intent · Unexpected — behavior deviates · Contradicts artifacts — behavior opposes an explicit rule.

Python API

attribution.py python
from llmhq_releaseops.attribution.analyzer import AttributionAnalyzer

analyzer = AttributionAnalyzer(store, prompt_bridge)
attribution = analyzer.analyze(trace_data, "support-agent", "1.1.0")

print(attribution.primary_influence)   # Highest-confidence explanation
print(attribution.overall_assessment)  # "Expected" / "Unexpected" / "Contradicts artifacts"

# Batch analysis
releaseops attribution analyze-batch support-agent 1.1.0 --traces traces.json

LangSmith Integration

Query LangSmith traces filtered by ReleaseOps bundle metadata. Aggregate behavioral metrics directly from your LangSmith project.

terminal bash
pip install llmhq-releaseops[langsmith]
langsmith_analytics.py python
import os
from llmhq_releaseops.analytics.platforms.langsmith import LangSmithPlatform
from llmhq_releaseops.analytics import TraceQuerier, MetricsAggregator

platform = LangSmithPlatform(api_key=os.environ["LANGSMITH_API_KEY"])
querier = TraceQuerier(platform)

# Fetch traces tagged with ReleaseOps bundle metadata
traces = querier.query_by_bundle("support-agent", "1.1.0", "prod")
metrics = MetricsAggregator().aggregate(traces, "support-agent", "1.1.0", "prod")

# CLI equivalent
# LANGSMITH_API_KEY=... releaseops analytics metrics support-agent@prod
# LANGSMITH_API_KEY=... releaseops analytics compare support-agent@1.0.0 support-agent@1.1.0

Key concepts

Bundle
Immutable, content-addressed manifest of prompts + policies + model config (SHA-256 verified)
Environment
Named deployment target (dev/staging/prod) with a pinned bundle version
Promotion
Moving a bundle through environments with optional quality gates (eval, approval, soak)
Telemetry
Automatic injection of bundle metadata into OpenTelemetry spans
Attribution
Trace agent behavior back to specific prompt lines and policy rules
Analytics
Aggregate behavioral metrics and compare versions to quantify behavioral shifts

Requirements

  • Python 3.10+
  • Git (required for storage)
  • Dependencies: Typer, PyYAML, Jinja2, GitPython, OpenTelemetry, llmhq-promptops

See it in action

Watch ReleaseOps bundle prompts, promote through environments, and trace behavioral changes.