Core Concepts
Observations
The Observations page shows all traces sent by your instrumented applications. A trace represents one end-to-end execution — for example, a user asking your agent a question and receiving a response.
Each trace contains one or more spans. A span is a single unit of work such as an LLM call, a tool invocation, or a custom operation you defined with a decorator.
You can filter observations by time range, application name, tags, and status. Click any trace to open a detailed timeline view showing the full span tree, token counts, latencies, and input/output content.
LLM Requests
The LLM Requests page provides a real-time stream of individual LLM calls across all your instrumented applications. Use it when you need to:
- Monitor live production traffic
- Spot errors or slow responses as they happen
- Inspect the exact prompts and completions for a specific call
Cost Analytics
Cost Analytics dashboard aggregates token usage and estimated costs across your organization. The platform automatically calculates costs based on the model and provider reported in each span.
Use cost analytics to:
- Track spending trends over time
- Compare costs across models (for example, GPT-4.1 versus GPT-4.1-mini)
- Identify which applications or workflows are most expensive
- Set usage expectations for your team
Evaluation Tasks
Evaluation tasks let you assess the quality of your AI agent's outputs using an LLM-as-a-Judge approach. You define what to evaluate, choose an evaluator template, and the platform runs the evaluation automatically.
There are two types of evaluation tasks:
- Historical — Evaluate traces that have already been collected. Useful for batch quality checks and regression testing.
- Real-time — Continuously evaluate new traces as they arrive. Useful for ongoing production monitoring.
To create an evaluation task, use the wizard at Evaluation Tasks > Create New. You will:
- Choose a name and data type (Historical or Real-time)
- Select which traces to evaluate (by time range, app name, or tags)
- Pick an evaluator template (or create a custom one)
- Configure evaluator parameters
- Run the task
Evaluator Templates
An evaluator template defines how an LLM judge scores your agent's output. Each template contains:
- A system prompt — instructions for the judge LLM
- Scoring criteria — what constitutes a good or bad response
- An LLM integration — which model to use as the judge
The platform includes built-in templates for common quality dimensions such as relevance, helpfulness, and safety. You can also create custom templates tailored to your domain.
Scores
The Scores page shows evaluation results across all your completed and active evaluation tasks. Each score includes the evaluator's verdict, a numeric rating, and an explanation.
Use scores to:
- Track quality trends over time
- Compare different agent versions or prompt strategies
- Identify specific traces that scored poorly for debugging
API Keys
API keys authenticate your SDK instrumentation with the collector. Every trace your application sends must include a valid API key.
To create an API key:
- Go to API Keys in the left sidebar
- Click Create API Key
- Give it a descriptive name (for example, "production-agent-v2")
- Copy and store the key securely — it is shown only once
Each API key is scoped to your organization. You can create multiple keys to separate environments (development, staging, production) or teams.
Units
A unit is the basic measure of usage in the Progress Observability Platform.
- 1 telemetry span = 1 unit
- 1 evaluation = 2 spans = 2 units
Telemetry spans come from SDK traces such as LLM calls, tool invocations, agents, workflows, and custom spans.
Evaluation tasks use an LLM-as-a-Judge and generate two internal spans per evaluation, which is why each evaluated span consumes 2 units.
LLM Integrations
LLM integrations connect external LLM providers (such as OpenAI, Azure OpenAI, or Anthropic) to the platform for use in evaluation tasks. When you run an evaluation, the platform calls the configured LLM integration to act as the judge.
LLM integrations are only used by the platform for evaluations. Your application's own LLM calls are instrumented by the SDK independently.