Leaderboards

Leaderboards are the main Dr.Gero workflow. They define a task, dataset, evaluation method, candidate models, ranking runs, and the production inference target.

Create Leaderboard wizard

Open Leaderboards → Create Leaderboard.

1. Name

The name is used for the leaderboard and the associated challenge.

2. System prompt

The prompt defines the task. It can include a placeholder:

text

Answer the following task as clearly and concisely as possible.

Input:
{input}

Supported placeholders include {input}, {question}, and {query}. If no placeholder exists, Dr.Gero appends the input to the prompt.

3. Dataset

Choose one mode:

Dataset mode	UI fields	Behavior
Get dataset	Hugging Face URL	Dr.Gero reads a `.jsonl` or `.jsonl.gz` dataset. The UI validates the URL before continuing.
Push dataset	max samples, daily/monthly limits, consolidation cadence, optional end date	Dr.Gero creates a webhook-style dataset that your app can append to.

Push datasets require enough accepted rows before you can add models. The UI guidance uses 100 rows as the minimum before model onboarding.

4. Evaluation type

Type	Description
Exact	Deterministic output comparison.
Judge	A judge provider/model scores candidate outputs. Defaults are based on OpenRouter.
Human	Human-reviewed evaluation workflow.

Leaderboard detail view

After you select a leaderboard, the UI shows three tabs.

Ranking

The Ranking tab shows the current ranking table, selected production model, inference endpoint, and model-selection strategy.

Model selection can be:

Ranking winner: automatically use the current top-ranked model.
Manual: pin a chosen leaderboard model.

Detail

The Detail tab shows:

Candidate model list.
System prompt.
Dataset configuration.
Dataset path or push webhook metadata.
Evaluation config.
Traces URL.
Schedule JSON.

For PUSH leaderboards, the Detail tab also lets you create a dataset token and shows example webhook usage.

Run Logs

The Run Logs tab shows historical runs with:

Source: manual, schedule, or dataset improvement.
Started/finished time.
Execution time.
Total cost.
Ranking summary.
Leaderboard configuration JSON.
Model configuration JSON.

Add models

Click Add Model from the leaderboard detail view.

Auto-select models

Auto-select chooses OpenRouter models using constraints:

Number of models.
Optional input/output cost limits per million tokens.
Optional P95/P99 latency limits.
Optional open-source-only filter.

Manual model

Manual add supports:

Platform	Required fields	Notes
OpenRouter	Model ID	Uses the workspace OpenRouter integration.
Custom	API endpoint, optional auth method/token	Your endpoint should accept a POST request and return OpenAI-compatible chat-completions output or JSON/text.
Hugging Face	Endpoint URL and optional token	Available through API; the UI may mark it unavailable depending on deployment.
Dr.Gero	Deployed Dr.Gero model	Lets you evaluate a model created in the Models area.

Run a leaderboard

Click Run after at least two models are attached. The app estimates runtime based on visible model count and dataset row count. While a run is active, the UI shows status and disables conflicting actions.

Deleting leaderboards

Paid plans can delete leaderboards. The free plan may lock deletion and limit workspaces to a small number of leaderboards.

Leaderboards ​

Create Leaderboard wizard ​

1. Name ​

2. System prompt ​

3. Dataset ​

4. Evaluation type ​

Leaderboard detail view ​

Ranking ​

Detail ​

Run Logs ​

Add models ​

Auto-select models ​

Manual model ​

Run a leaderboard ​

Deleting leaderboards ​