Core concepts

Workspace

A workspace represents the business account. Admin-only pages include Settings, Tokens, Integrations, and Billing. Owners and admins can invite users and manage workspace access.

Integrations

Integrations store provider credentials for the workspace:

OpenRouter is required for leaderboard creation, model auto-selection, judge evaluation, and OpenRouter model calls.
Hugging Face is recommended for private or gated datasets.

Integration validation happens through /api/integrations/validate using the signed-in user's Supabase session.

API token

A Dr.Gero API token is a server-side credential that starts with drgero_. It can have scopes, an optional expiration, and an optional dollar budget. Runtime endpoints accept it as:

http

Authorization: Bearer drgero_...

or, where CORS allows it:

http

X-API-Key: drgero_...
X-Dr.Gero-API-Key: drgero_...

Leaderboard

A leaderboard combines a task prompt, dataset, candidate models, evaluation configuration, run history, and the currently selected production model. The selected model can be the ranking winner or a manual override.

Challenge prompt

The challenge prompt is the system/task prompt applied to dataset examples and inference requests. If the prompt contains {input}, {question}, or {query}, Dr.Gero replaces the placeholder. Otherwise, it appends the user input to the prompt.

Dataset modes

Mode	Description	Typical use
GET	Read a `.jsonl` or `.jsonl.gz` dataset from Hugging Face.	Static benchmark or curated eval set.
PUSH	Accept examples through a webhook and periodically consolidate them into a JSONL dataset.	Production feedback loops and trace collection.

Evaluation types

Type	Description
Exact match	Compare model output with expected output exactly or through deterministic matching.
Judge	Use a judge model, usually via OpenRouter, to score outputs against a rubric or expected answer.
Human	Track manually reviewed results.

Candidate model

A candidate model is a model endpoint attached to a leaderboard. It may be OpenRouter, Custom, Hugging Face, or a Dr.Gero model. Leaderboard runs evaluate candidate models and write ranking rows.

Run

A run evaluates selected candidate models against the leaderboard dataset. Runs can be manual, scheduled, or triggered by dataset improvement workflows. Each run stores model configs, leaderboard config, cost, timing, and ranking output.

Trace

A trace is a JSON record of a run, inference call, dataset event, or manual event. Traces power debugging, auditing, and dataset improvement.

Dr.Gero model

A Dr.Gero model is a workspace model object that can be assigned to leaderboards and fine-tuned. Fine-tune runs can use leaderboard datasets and support schedules.

Core concepts ​

Workspace ​

Integrations ​

API token ​

Leaderboard ​

Challenge prompt ​

Dataset modes ​

Evaluation types ​

Candidate model ​

Run ​

Trace ​

Dr.Gero model ​