v0.1.1  ·  Open Source  ·  MIT License

pytest for AI.

Evaluate LLM outputs and AI agent behavior with a zero-dependency Python framework. Neutral, extensible, and not owned by any AI company.

Get Started → View on GitHub
$ pip install rubric-eval
Zero required dependencies
Native pytest integration
First-class agent trace eval
— Works with any LLM
Local HTML reports
— Import from LangFuse & LangSmith
Why Rubric

Built for developers.
Not for any AI company.

After Promptfoo joined OpenAI, the community needed a truly independent evaluation framework. Rubric is that answer — open source forever, no cloud required, no lock-in.

01

First-Class Agent Evaluation

Most eval frameworks only check the final output. Rubric evaluates the entire agent run — which tools were called, in what order, what the reasoning trace looks like, how long it took, and what it cost. You can require specific tools, forbid others, and penalize loops or redundant calls.

02

Works With Any LLM

Rubric is model-agnostic. Pass any Python callable as your judge function: OpenAI, Anthropic, Ollama, a local model, or a mock. No API keys required unless you use LLMJudge or GEval — Rubric auto-detects from your environment variables when you do.

03

Zero Required Dependencies

The entire core — string matching, agent metrics, results, CLI, and HTML reports — ships with nothing mandatory. Install extras only when you need them: pip install rubric-eval[semantic] for embeddings, [openai] or [anthropic] for LLM judging.

04

Native pytest Integration

Drop the rubric_eval fixture into any test file. Your LLM evals run inside the same pytest session as your unit tests — same CI pipeline, same runner, same output. No separate eval server or dashboard login needed.

05

Interactive HTML Reports

Every eval run can produce a self-contained HTML report — no server, no build step, just a file. Filter results by pass/fail, drill into per-metric score breakdowns with explanations, and see exact inputs and outputs side-by-side. Share it as a single file.

06

Truly Independent

MIT licensed. Not a product of OpenAI, Anthropic, Google, or any model provider. Rubric will never have a financial incentive to favor one model over another. Evaluate any model with the same unbiased framework — that's the whole point.

07

Zero-Friction Capture

Use rubric.capture() as a context manager or @rubric.track as a decorator to record every LLM call automatically — no manual TestCase construction needed. Captures input, output, latency, and context in one shot, then evaluate the whole session with a single call.

08

Import from LangFuse & LangSmith

Already using an observability platform? Import your existing traces directly into Rubric and run evals without touching your app. load_langfuse("traces.json") and load_langsmith("runs.json") convert exported traces into TestCase / AgentTestCase objects ready for evaluation.

What Makes Rubric Different

The metrics no one else ships.

These are the metrics that matter when you're deploying a real agent — not just a chatbot. Rubric ships them out of the box.

Tool Usage
Tool Call Accuracy & Efficiency
Verify not just what your agent did, but how well it did it. Two separate metrics cover correctness and quality of tool use.
ToolCallAccuracy — assert that every required tool was called, no forbidden tools were used, and optionally that they were called in the correct order. Score degrades proportionally to missing or unexpected tools.
ToolCallEfficiency — detect redundant calls (same tool, same args, called twice), failed tool invocations, and individual tools that exceeded a latency budget. Produces a composite efficiency score.
Safety
Safety Compliance
Scan every agent output and tool call for real-world safety violations before they reach users.
PII detection — flags Social Security numbers, credit card patterns, email addresses, and phone numbers leaking through responses.
Dangerous SQL — catches DROP, DELETE, TRUNCATE, and other destructive patterns in tool arguments before they execute.
Forbidden tool enforcement — fails the test if any tool on your deny-list was invoked, regardless of what the agent was instructed.
Reasoning
Reasoning & Trace Quality
Evaluate the quality of the agent's thinking, not just its final answer. Two metrics cover different angles of the same problem.
TraceQuality — analyzes the full reasoning trace for circular loops, repeated steps, and dead-end paths. Penalizes agents that cycle through the same actions without making progress.
ReasoningQuality — measures the ratio of reasoning steps to tool calls, and checks whether the agent updated its plan based on observations — a sign of genuine multi-step thinking.
RAG
Context Utilization & Hallucination Detection
For RAG pipelines: verify the agent actually used what it retrieved, and catch claims that aren't grounded in the source context.
ContextUtilization — checks that retrieved context is grounded in the final answer. Catches the common failure mode where an agent fetches documents but generates a response that doesn't reference them at all.
HallucinationScore — measures faithfulness of the output to the provided context. Two modes: LLM judge (returns a list of hallucinated claims) or local NLI model (no API key required). Score of 1.0 = fully grounded; 0.0 = fully hallucinated.
Quick Start

Up and running
in 5 minutes.

Install, define test cases, apply metrics, read the report.

eval.py
import rubriceval as rubric

# Replace with your real LLM call
def call_llm(prompt): ...

# Per-test metrics live on the TestCase.
# Shared metrics on evaluate() apply to all.
report = rubric.evaluate(
    test_cases=[
        rubric.TestCase(
            name="Pricing inquiry",
            input="What are the pricing plans?",
            actual_output=call_llm("What are the pricing plans?"),
            metrics=[rubric.Contains(["$29", "$99", "trial"])],
        ),
        rubric.TestCase(
            name="Cancellation flow",
            input="How do I cancel my subscription?",
            actual_output=call_llm("How do I cancel my subscription?"),
            metrics=[rubric.Contains(["Settings", "Billing", "export"])],
        ),
    ],
    metrics=[rubric.NotContains(["I don't know", "I'm not sure"])],
    output_html="report.html",
)
test_llm.py — pytest
import rubriceval as rubric

def test_agent_books_flight(rubric_eval):
    rubric_eval.add_case(
        rubric.AgentTestCase(
            input="Book a flight to Tokyo",
            actual_output=agent.run("Book a flight to Tokyo"),
            expected_tools=["search_flights", "book_flight"],
            tool_calls=agent.tool_calls,
        ),
        metrics=[
            rubric.ToolCallAccuracy(),
            rubric.LatencyMetric(max_ms=5000),
        ],
    )
    # auto-asserts on teardown
capture.py — zero-friction recording
import rubriceval as rubric

# Option A: context manager — explicit recording
with rubric.capture() as session:
    answer = my_llm("Where is the Eiffel Tower?")
    session.record(
        input="Where is the Eiffel Tower?",
        actual_output=answer,
        context="The Eiffel Tower is in Paris, built in 1889.",
    )

report = session.evaluate(metrics=[
    rubric.HallucinationScore(judge_fn=my_judge),
])

# Option B: decorator — capture every call automatically
@rubric.track
def ask(prompt, context=None):
    return my_llm(prompt)

ask("Who wrote Hamlet?")
ask("What is the capital of Egypt?")

report = rubric.get_session().evaluate(
    metrics=[rubric.Contains("Cairo")]
)
rubric.reset_session()
importers.py — LangFuse & LangSmith
from rubriceval.integrations.loaders import (
    load_langfuse, load_langsmith
)
import rubriceval as rubric

# Export traces from LangFuse UI → Traces → Export JSON
test_cases = load_langfuse("langfuse_export.json")

# Or from LangSmith: Project → Runs → Export JSON
test_cases = load_langsmith("langsmith_runs.json")

# Run evals on your existing production traces
report = rubric.evaluate(
    test_cases=test_cases,
    metrics=[
        rubric.HallucinationScore(judge_fn=my_judge),
        rubric.ToolCallAccuracy(),
        rubric.LatencyMetric(max_ms=5000),
    ],
    output_html="trace_report.html",
)
report.print_summary()
agent_eval.py — full agent eval
import rubriceval as rubric

# Pass what your agent actually did — tool calls, trace, latency
results = rubric.evaluate(
    test_cases=[
        rubric.AgentTestCase(
            name="Order inquiry",
            input="Where is my order #ORD-9821?",
            actual_output=agent.run("Where is my order #ORD-9821?"),
            expected_tools=["lookup_order", "create_ticket"],
            tool_calls=agent.tool_calls,
            trace=agent.trace,
            latency_ms=agent.latency_ms,
        ),
        rubric.AgentTestCase(
            name="Urgent — account locked",
            input="My account is locked, this is urgent.",
            actual_output=agent.run("My account is locked, this is urgent."),
            expected_tools=["create_ticket"],
            forbidden_tools=["send_email"],
            tool_calls=agent.tool_calls,
            trace=agent.trace,
            latency_ms=agent.latency_ms,
        ),
    ],
    metrics=[
        rubric.ToolCallAccuracy(check_order=False),
        rubric.TraceQuality(penalize_loops=True),
        rubric.TaskCompletion(),
        rubric.LatencyMetric(max_ms=3000),
    ],
    output_html="report.html",
)
Output

Every run produces a report
your team can actually use.

A single self-contained HTML file. No server. No login. Filter by pass/fail, drill into agent traces, inspect tool calls.

agent_report.html
Metrics Library

16 metrics across 4 categories.

Mix and match. Extend with your own by subclassing BaseMetric.

String Matching No deps
ExactMatch
Binary exact string comparison with optional case-insensitive mode. Returns 1.0 or 0.0. Best for structured outputs with a known correct answer.
Contains
Check that the output contains a substring or all/any items in a list. Use require_all=True to enforce that every item must be present.
NotContains
Passes only when the output does NOT contain the given string. Essential for safety guardrails — catch refusal phrases, hallucinated names, or forbidden content.
RegexMatch
Validate that the output matches a regex pattern. Ideal for format checks: dates, phone numbers, JSON structure, email addresses, or custom codes.
Semantic [semantic]
SemanticSimilarity
Embeds both the output and expected answer using sentence-transformers and computes cosine similarity. Catches correct answers phrased differently. Configurable threshold (default 0.8).
RougeScore
Measures n-gram overlap between output and a reference text. The standard metric for summarization quality. Supports ROUGE-1, ROUGE-2, and ROUGE-L.
LLM Judge [openai] / [anthropic]
LLMJudge
Use any LLM to score the output against criteria you define in plain English. Pass your own judge_fn callable or let Rubric auto-detect from your API key environment variables.
GEval
Chain-of-thought evaluation: the LLM reasons step-by-step before assigning a score. More accurate than single-pass judging for nuanced criteria like coherence or factual accuracy.
HallucinationScore
Measures faithfulness of the output to a provided context. Two modes: LLM judge (explains which specific claims are hallucinated) or local NLI model via transformers (no API key). Requires test_case.context.
Agent Metrics Built-in
ToolCallAccuracy
Assert all expected tools were called, no forbidden tools were used, and optionally that they appeared in the correct order. Score degrades proportionally to missing or unexpected tools.
ToolCallEfficiency
Detects redundant calls (same tool + same args repeated), failed tool invocations, and slow individual tools. Combines into a single efficiency score.
TraceQuality
Analyzes the full reasoning trace for loops, repeated steps, and dead-end paths. Penalizes agents that get stuck cycling through the same actions without progress.
ReasoningQuality
Measures the ratio of reasoning to tool calls, and checks whether the agent updated its plan based on what it observed — a signal of genuine multi-step thinking.
SafetyCompliance
Scans outputs and tool arguments for PII (SSNs, credit cards, emails), dangerous SQL (DROP, DELETE, TRUNCATE), and forbidden tool names. Critical before production deployment.
ContextUtilization
For RAG: verifies the agent actually used retrieved context in its answer rather than hallucinating. Catches the failure mode of fetching documents then ignoring them entirely.
TaskCompletion
Determines whether the agent actually finished the task. Uses heuristic keyword checking by default, or an LLM judge when provided.
LatencyMetric · CostMetric
Enforce performance and cost budgets. Set a max latency in milliseconds or max cost in USD — scores degrade gracefully above the threshold.

Start evaluating your LLM today.

Free, MIT licensed, and ready in minutes. No account, no cloud, no lock-in.