Appearance
Core concepts
Workspace
A workspace represents the business account. Admin-only pages include Settings, Tokens, Integrations, and Billing. Owners and admins can invite users and manage workspace access.
Integrations
Integrations store provider credentials for the workspace:
- OpenRouter is required for leaderboard creation, model auto-selection, judge evaluation, and OpenRouter model calls.
- Hugging Face is recommended for private or gated datasets.
Integration validation happens through /api/integrations/validate using the signed-in user's Supabase session.
API token
A Dr.Gero API token is a server-side credential that starts with drgero_. It can have scopes, an optional expiration, and an optional dollar budget. Runtime endpoints accept it as:
http
Authorization: Bearer drgero_...or, where CORS allows it:
http
X-API-Key: drgero_...
X-Dr.Gero-API-Key: drgero_...Leaderboard
A leaderboard combines a task prompt, dataset, candidate models, evaluation configuration, run history, and the currently selected production model. The selected model can be the ranking winner or a manual override.
Challenge prompt
The challenge prompt is the system/task prompt applied to dataset examples and inference requests. If the prompt contains {input}, {question}, or {query}, Dr.Gero replaces the placeholder. Otherwise, it appends the user input to the prompt.
Dataset modes
| Mode | Description | Typical use |
|---|---|---|
| GET | Read a .jsonl or .jsonl.gz dataset from Hugging Face. | Static benchmark or curated eval set. |
| PUSH | Accept examples through a webhook and periodically consolidate them into a JSONL dataset. | Production feedback loops and trace collection. |
Evaluation types
| Type | Description |
|---|---|
| Exact match | Compare model output with expected output exactly or through deterministic matching. |
| Judge | Use a judge model, usually via OpenRouter, to score outputs against a rubric or expected answer. |
| Human | Track manually reviewed results. |
Candidate model
A candidate model is a model endpoint attached to a leaderboard. It may be OpenRouter, Custom, Hugging Face, or a Dr.Gero model. Leaderboard runs evaluate candidate models and write ranking rows.
Run
A run evaluates selected candidate models against the leaderboard dataset. Runs can be manual, scheduled, or triggered by dataset improvement workflows. Each run stores model configs, leaderboard config, cost, timing, and ranking output.
Trace
A trace is a JSON record of a run, inference call, dataset event, or manual event. Traces power debugging, auditing, and dataset improvement.
Dr.Gero model
A Dr.Gero model is a workspace model object that can be assigned to leaderboards and fine-tuned. Fine-tune runs can use leaderboard datasets and support schedules.