Skip to content

Leaderboards

Leaderboards are the main Dr.Gero workflow. They define a task, dataset, evaluation method, candidate models, ranking runs, and the production inference target.

Create Leaderboard wizard

Open Leaderboards → Create Leaderboard.

1. Name

The name is used for the leaderboard and the associated challenge.

2. System prompt

The prompt defines the task. It can include a placeholder:

text
Answer the following task as clearly and concisely as possible.

Input:
{input}

Supported placeholders include {input}, {question}, and {query}. If no placeholder exists, Dr.Gero appends the input to the prompt.

3. Dataset

Choose one mode:

Dataset modeUI fieldsBehavior
Get datasetHugging Face URLDr.Gero reads a .jsonl or .jsonl.gz dataset. The UI validates the URL before continuing.
Push datasetmax samples, daily/monthly limits, consolidation cadence, optional end dateDr.Gero creates a webhook-style dataset that your app can append to.

Push datasets require enough accepted rows before you can add models. The UI guidance uses 100 rows as the minimum before model onboarding.

4. Evaluation type

TypeDescription
ExactDeterministic output comparison.
JudgeA judge provider/model scores candidate outputs. Defaults are based on OpenRouter.
HumanHuman-reviewed evaluation workflow.

Leaderboard detail view

After you select a leaderboard, the UI shows three tabs.

Ranking

The Ranking tab shows the current ranking table, selected production model, inference endpoint, and model-selection strategy.

Model selection can be:

  • Ranking winner: automatically use the current top-ranked model.
  • Manual: pin a chosen leaderboard model.

Detail

The Detail tab shows:

  • Candidate model list.
  • System prompt.
  • Dataset configuration.
  • Dataset path or push webhook metadata.
  • Evaluation config.
  • Traces URL.
  • Schedule JSON.

For PUSH leaderboards, the Detail tab also lets you create a dataset token and shows example webhook usage.

Run Logs

The Run Logs tab shows historical runs with:

  • Source: manual, schedule, or dataset improvement.
  • Started/finished time.
  • Execution time.
  • Total cost.
  • Ranking summary.
  • Leaderboard configuration JSON.
  • Model configuration JSON.

Add models

Click Add Model from the leaderboard detail view.

Auto-select models

Auto-select chooses OpenRouter models using constraints:

  • Number of models.
  • Optional input/output cost limits per million tokens.
  • Optional P95/P99 latency limits.
  • Optional open-source-only filter.

Manual model

Manual add supports:

PlatformRequired fieldsNotes
OpenRouterModel IDUses the workspace OpenRouter integration.
CustomAPI endpoint, optional auth method/tokenYour endpoint should accept a POST request and return OpenAI-compatible chat-completions output or JSON/text.
Hugging FaceEndpoint URL and optional tokenAvailable through API; the UI may mark it unavailable depending on deployment.
Dr.GeroDeployed Dr.Gero modelLets you evaluate a model created in the Models area.

Run a leaderboard

Click Run after at least two models are attached. The app estimates runtime based on visible model count and dataset row count. While a run is active, the UI shows status and disables conflicting actions.

Deleting leaderboards

Paid plans can delete leaderboards. The free plan may lock deletion and limit workspaces to a small number of leaderboards.