Rubric — The Independent LLM Evaluation Framework

Why Rubric

Built for developers.
Not for any AI company.

After Promptfoo joined OpenAI, the community needed a truly independent evaluation framework. Rubric is that answer — open source forever, no cloud required, no lock-in.

01

First-Class Agent Evaluation

Most eval frameworks only check the final output. Rubric evaluates the entire agent run — which tools were called, in what order, what the reasoning trace looks like, how long it took, and what it cost. You can require specific tools, forbid others, and penalize loops or redundant calls.

02

Works With Any LLM

Rubric is model-agnostic. Pass any Python callable as your judge function: OpenAI, Anthropic, Ollama, a local model, or a mock. No API keys required unless you use LLMJudge or GEval — Rubric auto-detects from your environment variables when you do.

03

Zero Required Dependencies

The entire core — string matching, agent metrics, results, CLI, and HTML reports — ships with nothing mandatory. Install extras only when you need them: pip install rubric-eval[semantic] for embeddings, [openai] or [anthropic] for LLM judging.

04

Native pytest Integration

Drop the rubric_eval fixture into any test file. Your LLM evals run inside the same pytest session as your unit tests — same CI pipeline, same runner, same output. No separate eval server or dashboard login needed.

05

Interactive HTML Reports

Every eval run can produce a self-contained HTML report — no server, no build step, just a file. Filter results by pass/fail, drill into per-metric score breakdowns with explanations, and see exact inputs and outputs side-by-side. Share it as a single file.

06

Truly Independent

MIT licensed. Not a product of OpenAI, Anthropic, Google, or any model provider. Rubric will never have a financial incentive to favor one model over another. Evaluate any model with the same unbiased framework — that's the whole point.

07

Zero-Friction Capture

Use rubric.capture() as a context manager or @rubric.track as a decorator to record every LLM call automatically — no manual TestCase construction needed. Captures input, output, latency, and context in one shot, then evaluate the whole session with a single call.

08

Import from LangFuse & LangSmith

Already using an observability platform? Import your existing traces directly into Rubric and run evals without touching your app. load_langfuse("traces.json") and load_langsmith("runs.json") convert exported traces into TestCase / AgentTestCase objects ready for evaluation.

What Makes Rubric Different

The metrics no one else ships.

These are the metrics that matter when you're deploying a real agent — not just a chatbot. Rubric ships them out of the box.

Tool Usage

Tool Call Accuracy & Efficiency

Verify not just what your agent did, but how well it did it. Two separate metrics cover correctness and quality of tool use.

ToolCallAccuracy — assert that every required tool was called, no forbidden tools were used, and optionally that they were called in the correct order. Score degrades proportionally to missing or unexpected tools.

ToolCallEfficiency — detect redundant calls (same tool, same args, called twice), failed tool invocations, and individual tools that exceeded a latency budget. Produces a composite efficiency score.

Safety

Safety Compliance

Scan every agent output and tool call for real-world safety violations before they reach users.

PII detection — flags Social Security numbers, credit card patterns, email addresses, and phone numbers leaking through responses.

Dangerous SQL — catches DROP, DELETE, TRUNCATE, and other destructive patterns in tool arguments before they execute.

Forbidden tool enforcement — fails the test if any tool on your deny-list was invoked, regardless of what the agent was instructed.

Reasoning

Reasoning & Trace Quality

Evaluate the quality of the agent's thinking, not just its final answer. Two metrics cover different angles of the same problem.

TraceQuality — analyzes the full reasoning trace for circular loops, repeated steps, and dead-end paths. Penalizes agents that cycle through the same actions without making progress.

ReasoningQuality — measures the ratio of reasoning steps to tool calls, and checks whether the agent updated its plan based on observations — a sign of genuine multi-step thinking.

RAG

Context Utilization & Hallucination Detection

For RAG pipelines: verify the agent actually used what it retrieved, and catch claims that aren't grounded in the source context.

ContextUtilization — checks that retrieved context is grounded in the final answer. Catches the common failure mode where an agent fetches documents but generates a response that doesn't reference them at all.

HallucinationScore — measures faithfulness of the output to the provided context. Two modes: LLM judge (returns a list of hallucinated claims) or local NLI model (no API key required). Score of 1.0 = fully grounded; 0.0 = fully hallucinated.

Quick Start

Up and running
in 5 minutes.

Install, define test cases, apply metrics, read the report.

eval.py

import rubriceval as rubric

# Replace with your real LLM call
def call_llm(prompt): ...

# Per-test metrics live on the TestCase.
# Shared metrics on evaluate() apply to all.
report = rubric.evaluate(
    test_cases=[
        rubric.TestCase(
            name="Pricing inquiry",
            input="What are the pricing plans?",
            actual_output=call_llm("What are the pricing plans?"),
            metrics=[rubric.Contains(["$29", "$99", "trial"])],
        ),
        rubric.TestCase(
            name="Cancellation flow",
            input="How do I cancel my subscription?",
            actual_output=call_llm("How do I cancel my subscription?"),
            metrics=[rubric.Contains(["Settings", "Billing", "export"])],
        ),
    ],
    metrics=[rubric.NotContains(["I don't know", "I'm not sure"])],
    output_html="report.html",
)

test_llm.py — pytest

import rubriceval as rubric

def test_agent_books_flight(rubric_eval):
    rubric_eval.add_case(
        rubric.AgentTestCase(
            input="Book a flight to Tokyo",
            actual_output=agent.run("Book a flight to Tokyo"),
            expected_tools=["search_flights", "book_flight"],
            tool_calls=agent.tool_calls,
        ),
        metrics=[
            rubric.ToolCallAccuracy(),
            rubric.LatencyMetric(max_ms=5000),
        ],
    )
    # auto-asserts on teardown

capture.py — zero-friction recording

import rubriceval as rubric

# Option A: context manager — explicit recording
with rubric.capture() as session:
    answer = my_llm("Where is the Eiffel Tower?")
    session.record(
        input="Where is the Eiffel Tower?",
        actual_output=answer,
        context="The Eiffel Tower is in Paris, built in 1889.",
    )

report = session.evaluate(metrics=[
    rubric.HallucinationScore(judge_fn=my_judge),
])

# Option B: decorator — capture every call automatically
@rubric.track
def ask(prompt, context=None):
    return my_llm(prompt)

ask("Who wrote Hamlet?")
ask("What is the capital of Egypt?")

report = rubric.get_session().evaluate(
    metrics=[rubric.Contains("Cairo")]
)
rubric.reset_session()

importers.py — LangFuse & LangSmith

from rubriceval.integrations.loaders import (
    load_langfuse, load_langsmith
)
import rubriceval as rubric

# Export traces from LangFuse UI → Traces → Export JSON
test_cases = load_langfuse("langfuse_export.json")

# Or from LangSmith: Project → Runs → Export JSON
test_cases = load_langsmith("langsmith_runs.json")

# Run evals on your existing production traces
report = rubric.evaluate(
    test_cases=test_cases,
    metrics=[
        rubric.HallucinationScore(judge_fn=my_judge),
        rubric.ToolCallAccuracy(),
        rubric.LatencyMetric(max_ms=5000),
    ],
    output_html="trace_report.html",
)
report.print_summary()

agent_eval.py — full agent eval

import rubriceval as rubric

# Pass what your agent actually did — tool calls, trace, latency
results = rubric.evaluate(
    test_cases=[
        rubric.AgentTestCase(
            name="Order inquiry",
            input="Where is my order #ORD-9821?",
            actual_output=agent.run("Where is my order #ORD-9821?"),
            expected_tools=["lookup_order", "create_ticket"],
            tool_calls=agent.tool_calls,
            trace=agent.trace,
            latency_ms=agent.latency_ms,
        ),
        rubric.AgentTestCase(
            name="Urgent — account locked",
            input="My account is locked, this is urgent.",
            actual_output=agent.run("My account is locked, this is urgent."),
            expected_tools=["create_ticket"],
            forbidden_tools=["send_email"],
            tool_calls=agent.tool_calls,
            trace=agent.trace,
            latency_ms=agent.latency_ms,
        ),
    ],
    metrics=[
        rubric.ToolCallAccuracy(check_order=False),
        rubric.TraceQuality(penalize_loops=True),
        rubric.TaskCompletion(),
        rubric.LatencyMetric(max_ms=3000),
    ],
    output_html="report.html",
)

Output

Every run produces a report
your team can actually use.

A single self-contained HTML file. No server. No login. Filter by pass/fail, drill into agent traces, inspect tool calls.

agent_report.html

Open full sample report →

Metrics Library

16 metrics across 4 categories.

Mix and match. Extend with your own by subclassing BaseMetric.

String Matching No deps

ExactMatch

Binary exact string comparison with optional case-insensitive mode. Returns 1.0 or 0.0. Best for structured outputs with a known correct answer.

Contains

Check that the output contains a substring or all/any items in a list. Use require_all=True to enforce that every item must be present.

NotContains

Passes only when the output does NOT contain the given string. Essential for safety guardrails — catch refusal phrases, hallucinated names, or forbidden content.

RegexMatch

Validate that the output matches a regex pattern. Ideal for format checks: dates, phone numbers, JSON structure, email addresses, or custom codes.

Semantic [semantic]

SemanticSimilarity

Embeds both the output and expected answer using sentence-transformers and computes cosine similarity. Catches correct answers phrased differently. Configurable threshold (default 0.8).

RougeScore

Measures n-gram overlap between output and a reference text. The standard metric for summarization quality. Supports ROUGE-1, ROUGE-2, and ROUGE-L.

LLM Judge [openai] / [anthropic]

LLMJudge

Use any LLM to score the output against criteria you define in plain English. Pass your own judge_fn callable or let Rubric auto-detect from your API key environment variables.

GEval

Chain-of-thought evaluation: the LLM reasons step-by-step before assigning a score. More accurate than single-pass judging for nuanced criteria like coherence or factual accuracy.

HallucinationScore

Measures faithfulness of the output to a provided context. Two modes: LLM judge (explains which specific claims are hallucinated) or local NLI model via transformers (no API key). Requires test_case.context.

Agent Metrics Built-in

ToolCallAccuracy

Assert all expected tools were called, no forbidden tools were used, and optionally that they appeared in the correct order. Score degrades proportionally to missing or unexpected tools.

ToolCallEfficiency

Detects redundant calls (same tool + same args repeated), failed tool invocations, and slow individual tools. Combines into a single efficiency score.

TraceQuality

Analyzes the full reasoning trace for loops, repeated steps, and dead-end paths. Penalizes agents that get stuck cycling through the same actions without progress.

ReasoningQuality

Measures the ratio of reasoning to tool calls, and checks whether the agent updated its plan based on what it observed — a signal of genuine multi-step thinking.

SafetyCompliance

Scans outputs and tool arguments for PII (SSNs, credit cards, emails), dangerous SQL (DROP, DELETE, TRUNCATE), and forbidden tool names. Critical before production deployment.

ContextUtilization

For RAG: verifies the agent actually used retrieved context in its answer rather than hallucinating. Catches the failure mode of fetching documents then ignoring them entirely.

TaskCompletion

Determines whether the agent actually finished the task. Uses heuristic keyword checking by default, or an LLM judge when provided.

LatencyMetric · CostMetric

Enforce performance and cost budgets. Set a max latency in milliseconds or max cost in USD — scores degrade gracefully above the threshold.

pytest for AI.

Built for developers.
Not for any AI company.

First-Class Agent Evaluation

Works With Any LLM

Zero Required Dependencies

Native pytest Integration

Interactive HTML Reports

Truly Independent

Zero-Friction Capture

Import from LangFuse & LangSmith

The metrics no one else ships.

Up and running
in 5 minutes.

Every run produces a report
your team can actually use.

16 metrics across 4 categories.

Start evaluating your LLM today.

pytest for AI.

Built for developers.Not for any AI company.

First-Class Agent Evaluation

Works With Any LLM

Zero Required Dependencies

Native pytest Integration

Interactive HTML Reports

Truly Independent

Zero-Friction Capture

Import from LangFuse & LangSmith

The metrics no one else ships.

Up and runningin 5 minutes.

Every run produces a reportyour team can actually use.

16 metrics across 4 categories.

Start evaluating your LLM today.

Built for developers.
Not for any AI company.

Up and running
in 5 minutes.

Every run produces a report
your team can actually use.