Running the Demo Benchmark
This guide explains how to run the stakeholder demo benchmark for the Baumbart chatbot from the repository root.
The benchmark runs directly against @leipzigtreechat/chatbot, replays a fixed
set of German questions and dialogues, captures metadata and transcripts, and
writes JSON plus Markdown reports.
What the benchmark does
Section titled “What the benchmark does”The benchmark currently includes:
- 12 single-turn example questions
- 6 multi-turn dialogues
- rubric-based scoring instead of exact string matching
- output reports in
packages/chatbot/benchmark-results
It is meant for demo evaluation and regression tracking, not as a strict CI gate.
Prerequisites
Section titled “Prerequisites”Before running the benchmark, make sure you have:
- installed dependencies with
bun install - a valid OpenRouter API key
- the local Qanary stack running
- the required environment variables in the root
.env
For the full local stack setup, see Getting Started.
Required environment variables
Section titled “Required environment variables”The root benchmark command explicitly loads the repository root .env.
The following variables must be set there:
VIRTUOSO_DBA_PASSWORD=your-secure-passwordOPENROUTER_API_KEY=sk-or-v1-...QANARY_API_BASE_URL=http://localhost:8080TRIPLESTORE_URL=http://localhost:8890/sparqlThese are the same local defaults already used elsewhere in the project:
QANARY_API_BASE_URL=http://localhost:8080TRIPLESTORE_URL=http://localhost:8890/sparql
Start the local services
Section titled “Start the local services”The benchmark is a live-stack benchmark, so the supporting services must be reachable before you run it.
Start the Docker services from the repository root:
docker compose up -d virtuoso qanary_pipeline leipzig-tree-knowledge-graphStart the Docker-backed infrastructure first. The root benchmark command now starts the required local Qanary components automatically.
Run the benchmark
Section titled “Run the benchmark”From the repository root:
bun run benchmark:demoThe root script currently does five things:
- loads the root
.env - starts the five required Qanary components through Turbo
- waits until each component reports that it started
- waits until each component reports that it was registered in the Qanary pipeline
- runs the package benchmark
- shuts the component dev process down again
You can still run the package-level benchmark directly if the components are already running:
bun --env-file=.env run --filter @leipzigtreechat/chatbot benchmark:demoBenchmark output
Section titled “Benchmark output”Each run writes four files under packages/chatbot/benchmark-results:
- a timestamped JSON report
- a timestamped Markdown report
latest.jsonlatest.md
Example:
packages/chatbot/benchmark-results/demo-benchmark-2026-03-22T12-49-11.637Z.jsonpackages/chatbot/benchmark-results/demo-benchmark-2026-03-22T12-49-11.637Z.mdpackages/chatbot/benchmark-results/latest.jsonpackages/chatbot/benchmark-results/latest.mdThe Markdown report includes:
- total case count
- pass rate
- clarification success rate
- graceful-failure success rate
- median case duration
- recommended demo subset
- grouped passed, soft-failed, and failed cases
- full transcripts for dialogue cases
Interpreting failures
Section titled “Interpreting failures”Common failure patterns:
- Missing
TRIPLESTORE_URL: the benchmark cannot query the Qanary annotations graph and will fail very early. - Missing
QANARY_API_BASE_URL: the benchmark cannot call the Qanary pipeline. - Missing
OPENROUTER_API_KEY: question rewriting or response generation will fail in the LLM-backed steps. - Docker services not running: Virtuoso or the Qanary pipeline are unreachable.
- Missing component
.envfiles: the root benchmark wrapper refuses to start if one of the local Qanary component env files does not exist. - Component registration does not succeed: a Qanary component may start locally
but still fail to register in the pipeline if
SPRING_BOOT_ADMIN_URL,QANARY_HOST, or Docker networking is wrong.
If the benchmark starts but reports poor results across all cases, open
packages/chatbot/benchmark-results/latest.md first. It is the fastest way to
see whether the problem is:
- environment setup
- pipeline connectivity
- LLM configuration
- genuine chatbot quality
- The benchmark tolerates paraphrased wording and does not require exact output strings.
- The benchmark currently runs against the live runtime, so results depend on data availability, model behavior, and local infrastructure health.
- Re-running the benchmark updates
latest.jsonandlatest.md.