Evaluate LLM outputs and AI agent behavior with a zero-dependency Python framework. Neutral, extensible, and not owned by any AI company.
After Promptfoo joined OpenAI, the community needed a truly independent evaluation framework. Rubric is that answer — open source forever, no cloud required, no lock-in.
Most eval frameworks only check the final output. Rubric evaluates the entire agent run — which tools were called, in what order, what the reasoning trace looks like, how long it took, and what it cost. You can require specific tools, forbid others, and penalize loops or redundant calls.
Rubric is model-agnostic. Pass any Python callable as your judge function: OpenAI, Anthropic, Ollama, a local model, or a mock. No API keys required unless you use LLMJudge or GEval — Rubric auto-detects from your environment variables when you do.
The entire core — string matching, agent metrics, results, CLI, and HTML reports — ships with nothing mandatory. Install extras only when you need them: pip install rubric-eval[semantic] for embeddings, [openai] or [anthropic] for LLM judging.
Drop the rubric_eval fixture into any test file. Your LLM evals run inside the same pytest session as your unit tests — same CI pipeline, same runner, same output. No separate eval server or dashboard login needed.
Every eval run can produce a self-contained HTML report — no server, no build step, just a file. Filter results by pass/fail, drill into per-metric score breakdowns with explanations, and see exact inputs and outputs side-by-side. Share it as a single file.
MIT licensed. Not a product of OpenAI, Anthropic, Google, or any model provider. Rubric will never have a financial incentive to favor one model over another. Evaluate any model with the same unbiased framework — that's the whole point.
Use rubric.capture() as a context manager or @rubric.track as a decorator to record every LLM call automatically — no manual TestCase construction needed. Captures input, output, latency, and context in one shot, then evaluate the whole session with a single call.
Already using an observability platform? Import your existing traces directly into Rubric and run evals without touching your app. load_langfuse("traces.json") and load_langsmith("runs.json") convert exported traces into TestCase / AgentTestCase objects ready for evaluation.
These are the metrics that matter when you're deploying a real agent — not just a chatbot. Rubric ships them out of the box.
Install, define test cases, apply metrics, read the report.
import rubriceval as rubric # Replace with your real LLM call def call_llm(prompt): ... # Per-test metrics live on the TestCase. # Shared metrics on evaluate() apply to all. report = rubric.evaluate( test_cases=[ rubric.TestCase( name="Pricing inquiry", input="What are the pricing plans?", actual_output=call_llm("What are the pricing plans?"), metrics=[rubric.Contains(["$29", "$99", "trial"])], ), rubric.TestCase( name="Cancellation flow", input="How do I cancel my subscription?", actual_output=call_llm("How do I cancel my subscription?"), metrics=[rubric.Contains(["Settings", "Billing", "export"])], ), ], metrics=[rubric.NotContains(["I don't know", "I'm not sure"])], output_html="report.html", )
import rubriceval as rubric def test_agent_books_flight(rubric_eval): rubric_eval.add_case( rubric.AgentTestCase( input="Book a flight to Tokyo", actual_output=agent.run("Book a flight to Tokyo"), expected_tools=["search_flights", "book_flight"], tool_calls=agent.tool_calls, ), metrics=[ rubric.ToolCallAccuracy(), rubric.LatencyMetric(max_ms=5000), ], ) # auto-asserts on teardown
import rubriceval as rubric # Option A: context manager — explicit recording with rubric.capture() as session: answer = my_llm("Where is the Eiffel Tower?") session.record( input="Where is the Eiffel Tower?", actual_output=answer, context="The Eiffel Tower is in Paris, built in 1889.", ) report = session.evaluate(metrics=[ rubric.HallucinationScore(judge_fn=my_judge), ]) # Option B: decorator — capture every call automatically @rubric.track def ask(prompt, context=None): return my_llm(prompt) ask("Who wrote Hamlet?") ask("What is the capital of Egypt?") report = rubric.get_session().evaluate( metrics=[rubric.Contains("Cairo")] ) rubric.reset_session()
from rubriceval.integrations.loaders import ( load_langfuse, load_langsmith ) import rubriceval as rubric # Export traces from LangFuse UI → Traces → Export JSON test_cases = load_langfuse("langfuse_export.json") # Or from LangSmith: Project → Runs → Export JSON test_cases = load_langsmith("langsmith_runs.json") # Run evals on your existing production traces report = rubric.evaluate( test_cases=test_cases, metrics=[ rubric.HallucinationScore(judge_fn=my_judge), rubric.ToolCallAccuracy(), rubric.LatencyMetric(max_ms=5000), ], output_html="trace_report.html", ) report.print_summary()
import rubriceval as rubric # Pass what your agent actually did — tool calls, trace, latency results = rubric.evaluate( test_cases=[ rubric.AgentTestCase( name="Order inquiry", input="Where is my order #ORD-9821?", actual_output=agent.run("Where is my order #ORD-9821?"), expected_tools=["lookup_order", "create_ticket"], tool_calls=agent.tool_calls, trace=agent.trace, latency_ms=agent.latency_ms, ), rubric.AgentTestCase( name="Urgent — account locked", input="My account is locked, this is urgent.", actual_output=agent.run("My account is locked, this is urgent."), expected_tools=["create_ticket"], forbidden_tools=["send_email"], tool_calls=agent.tool_calls, trace=agent.trace, latency_ms=agent.latency_ms, ), ], metrics=[ rubric.ToolCallAccuracy(check_order=False), rubric.TraceQuality(penalize_loops=True), rubric.TaskCompletion(), rubric.LatencyMetric(max_ms=3000), ], output_html="report.html", )
A single self-contained HTML file. No server. No login. Filter by pass/fail, drill into agent traces, inspect tool calls.
Mix and match. Extend with your own by subclassing BaseMetric.
require_all=True to enforce that every item must be present.judge_fn callable or let Rubric auto-detect from your API key environment variables.transformers (no API key). Requires test_case.context.