Skip to main content

Inference Benchmarking And Cost Model

LLM benchmarks are useful only when they match the request mix you expect in production. Measure latency phases and unit economics, not only headline tokens per second.

Benchmark dimensions

DimensionWhy it matters
Prompt tokensPrefill time and memory pressure grow with input size.
Output tokensDecode duration drives user wait and GPU occupancy.
ConcurrencyBatching efficiency and queue wait change under load.
Model versionMemory profile, context length, and throughput can shift between revisions.
Route classInteractive, agentic, and batch traffic need different SLOs.

Required metrics

  • Time to first token.
  • Inter-token latency and output tokens per second.
  • Queue wait before runtime execution.
  • Prompt tokens, completion tokens, and total tokens.
  • GPU memory used, GPU utilization, and cache pressure.
  • Error rate by model, route, and request class.

Cost model

Start with a simple unit model:

request_cost = accelerator_hourly_cost * request_gpu_seconds / 3600
+ platform_overhead
+ storage_and_retrieval_cost

Then segment by request class:

Request classCost concern
Interactive chatWarm capacity and tail latency.
Agent workflowsMulti-call amplification and tool latency.
RAGRetrieval, reranking, context size, and generation.
Batch inferenceThroughput per GPU hour and deadline windows.

Acceptance criteria

A benchmark is publishable when it includes workload shape, model version, hardware profile, concurrency, request distribution, latency percentiles, and cost per request class.