Appearance
Leaderboards
Leaderboards are the main Dr.Gero workflow. They define a task, dataset, evaluation method, candidate models, ranking runs, and the production inference target.
Create Leaderboard wizard
Open Leaderboards → Create Leaderboard.
1. Name
The name is used for the leaderboard and the associated challenge.
2. System prompt
The prompt defines the task. It can include a placeholder:
text
Answer the following task as clearly and concisely as possible.
Input:
{input}Supported placeholders include {input}, {question}, and {query}. If no placeholder exists, Dr.Gero appends the input to the prompt.
3. Dataset
Choose one mode:
| Dataset mode | UI fields | Behavior |
|---|---|---|
| Get dataset | Hugging Face URL | Dr.Gero reads a .jsonl or .jsonl.gz dataset. The UI validates the URL before continuing. |
| Push dataset | max samples, daily/monthly limits, consolidation cadence, optional end date | Dr.Gero creates a webhook-style dataset that your app can append to. |
Push datasets require enough accepted rows before you can add models. The UI guidance uses 100 rows as the minimum before model onboarding.
4. Evaluation type
| Type | Description |
|---|---|
| Exact | Deterministic output comparison. |
| Judge | A judge provider/model scores candidate outputs. Defaults are based on OpenRouter. |
| Human | Human-reviewed evaluation workflow. |
Leaderboard detail view
After you select a leaderboard, the UI shows three tabs.
Ranking
The Ranking tab shows the current ranking table, selected production model, inference endpoint, and model-selection strategy.
Model selection can be:
- Ranking winner: automatically use the current top-ranked model.
- Manual: pin a chosen leaderboard model.
Detail
The Detail tab shows:
- Candidate model list.
- System prompt.
- Dataset configuration.
- Dataset path or push webhook metadata.
- Evaluation config.
- Traces URL.
- Schedule JSON.
For PUSH leaderboards, the Detail tab also lets you create a dataset token and shows example webhook usage.
Run Logs
The Run Logs tab shows historical runs with:
- Source: manual, schedule, or dataset improvement.
- Started/finished time.
- Execution time.
- Total cost.
- Ranking summary.
- Leaderboard configuration JSON.
- Model configuration JSON.
Add models
Click Add Model from the leaderboard detail view.
Auto-select models
Auto-select chooses OpenRouter models using constraints:
- Number of models.
- Optional input/output cost limits per million tokens.
- Optional P95/P99 latency limits.
- Optional open-source-only filter.
Manual model
Manual add supports:
| Platform | Required fields | Notes |
|---|---|---|
| OpenRouter | Model ID | Uses the workspace OpenRouter integration. |
| Custom | API endpoint, optional auth method/token | Your endpoint should accept a POST request and return OpenAI-compatible chat-completions output or JSON/text. |
| Hugging Face | Endpoint URL and optional token | Available through API; the UI may mark it unavailable depending on deployment. |
| Dr.Gero | Deployed Dr.Gero model | Lets you evaluate a model created in the Models area. |
Run a leaderboard
Click Run after at least two models are attached. The app estimates runtime based on visible model count and dataset row count. While a run is active, the UI shows status and disables conflicting actions.
Deleting leaderboards
Paid plans can delete leaderboards. The free plan may lock deletion and limit workspaces to a small number of leaderboards.