Core Concepts

Updated on Jun 16, 2026

Observations

The Observations page shows all traces sent by your instrumented applications. A trace represents one end-to-end execution — for example, a user asking your agent a question and receiving a response.

Each trace contains one or more spans. A span is a single unit of work such as an LLM call, a tool invocation, or a custom operation you defined with a decorator.

You can filter observations by time range, application name, tags, and status. Click any trace to open a detailed timeline view showing the full span tree, token counts, latencies, and input/output content.

LLM Requests

The LLM Requests page provides a real-time stream of individual LLM calls across all your instrumented applications. Use it when you need to:

  • Monitor live production traffic
  • Spot errors or slow responses as they happen
  • Inspect the exact prompts and completions for a specific call

Cost Analytics

Cost Analytics dashboard aggregates token usage and estimated costs across your organization. The platform automatically calculates costs based on the model and provider reported in each span.

Use cost analytics to:

  • Track spending trends over time
  • Compare costs across models (for example, GPT-4.1 versus GPT-4.1-mini)
  • Identify which applications or workflows are most expensive
  • Set usage expectations for your team

Evaluation Tasks

Evaluation tasks let you assess the quality of your AI agent's outputs using an LLM-as-a-Judge approach. You define what to evaluate, choose an evaluator template, and the platform runs the evaluation automatically.

There are two types of evaluation tasks:

  • Historical — Evaluate traces that have already been collected. Useful for batch quality checks and regression testing.
  • Real-time — Continuously evaluate new traces as they arrive. Useful for ongoing production monitoring.

To create an evaluation task, use the wizard at Evaluation Tasks > Create New. You will:

  1. Choose a name and data type (Historical or Real-time)
  2. Select which traces to evaluate (by time range, app name, or tags)
  3. Pick an evaluator template (or create a custom one)
  4. Configure evaluator parameters
  5. Run the task

Evaluator Templates

An evaluator template defines how an LLM judge scores your agent's output. Each template contains:

  • A system prompt — instructions for the judge LLM
  • Scoring criteria — what constitutes a good or bad response
  • An LLM integration — which model to use as the judge

The platform includes built-in templates for common quality dimensions such as relevance, helpfulness, and safety. You can also create custom templates tailored to your domain.

Scores

The Scores page shows evaluation results across all your completed and active evaluation tasks. Each score includes the evaluator's verdict, a numeric rating, and an explanation.

Use scores to:

  • Track quality trends over time
  • Compare different agent versions or prompt strategies
  • Identify specific traces that scored poorly for debugging

API Keys

API keys authenticate your SDK instrumentation with the collector. Every trace your application sends must include a valid API key.

To create an API key:

  1. Go to API Keys in the left sidebar
  2. Click Create API Key
  3. Give it a descriptive name (for example, "production-agent-v2")
  4. Copy and store the key securely — it is shown only once

Each API key is scoped to your organization. You can create multiple keys to separate environments (development, staging, production) or teams.

Units

A unit is the basic measure of usage in the Progress Observability Platform.

  • 1 telemetry span = 1 unit
  • 1 evaluation = 2 spans = 2 units

Telemetry spans come from SDK traces such as LLM calls, tool invocations, agents, workflows, and custom spans.

Evaluation tasks use an LLM-as-a-Judge and generate two internal spans per evaluation, which is why each evaluated span consumes 2 units.

LLM Integrations

LLM integrations connect external LLM providers (such as OpenAI, Azure OpenAI, or Anthropic) to the platform for use in evaluation tasks. When you run an evaluation, the platform calls the configured LLM integration to act as the judge.

LLM integrations are only used by the platform for evaluations. Your application's own LLM calls are instrumented by the SDK independently.

See Also