How to Measure AI Voice Agent Performance: Complete Framework

How to Measure AI Voice Agent Performance: Complete Framework | Thoughtly

Last updated June, 2026

How to Measure AI Voice Agent Performance: A Complete Framework for Revenue Teams

You deployed an AI voice agent. Calls are happening. But are they working? Without a measurement framework, you're left counting call volume and hoping the pipeline shows up. This guide walks through the metrics, tools, and cadence revenue teams need to evaluate AI voice agent performance — from operational health to revenue impact.

Whether you're running inbound speed-to-lead agents, re-engagement campaigns, or appointment-setting workflows, the same principle applies: a completed call is not the same as a successful outcome. You need to measure what matters for your funnel, not just what's easy to count.

What you'll learn

Which metrics matter for AI voice agents — and which ones are vanity
How to set up measurement infrastructure using Thoughtly Analytics, History, Automations, and CRM sync
How to build a weekly review cadence that catches problems early
How to attribute pipeline to your AI agents and prove ROI
Common measurement mistakes that lead to false conclusions

What you'll need

A deployed Thoughtly voice agent with at least 50 completed calls
Access to Thoughtly Analytics and History in your workspace
An Automation with the On Call Completed trigger configured for post-call data capture
A CRM (HubSpot, Salesforce, or similar) where call outcomes can be tracked against pipeline
A spreadsheet or BI tool for weekly reporting (Google Sheets, Smartsheet, or Looker Studio work well)

Step 1: Define your outcome taxonomy before measuring anything

Before you can measure performance, you need to know what "good" looks like. Define the outcomes that matter for your specific funnel. In Thoughtly, outcomes are the branching labels that route conversations after each caller response — but at the call level, you also have dispositions that auto-tag calls based on transcript analysis.

For a typical inbound lead conversion funnel (insurance, mortgage, education, healthcare, home services), your outcome taxonomy should include:

Booked — the lead scheduled an appointment or consultation
Qualified — the lead met your fit criteria and is ready for human handoff
Not qualified — the lead doesn't meet fit criteria (wrong location, wrong product, not eligible)
Callback requested — the lead asked to be called back later
Voicemail — the call reached voicemail (inbound) or left voicemail (outbound)
No answer — the call was not connected
Transferred — the call was escalated to a human rep

The key decision: what counts as a "successful" call? For most lead conversion workflows, booked and qualified are your success outcomes. Everything else is either a process outcome (callback, transfer) or a non-outcome (no answer, voicemail, DNQ). This distinction drives every downstream metric.

Step 2: Set up your measurement infrastructure in Thoughtly

Thoughtly gives you three layers of measurement: Analytics for aggregate dashboards, History for per-call detail, and Automations for piping call data into your CRM or reporting tools. Here's how to wire them together.

Configure the On Call Completed trigger

In Thoughtly Automations, the On Call Completed trigger fires after every call ends. It outputs rich call data: durations, outcomes, variables captured, action flags, transfers, and voicemail detection. This is your primary data pipeline for measurement.

To set it up:

Go to Automations in Thoughtly and create a new automation
Select Thoughtly → On Call Completed as the trigger
Choose the agent scope: specific agents or all agents
Add downstream steps: send a webhook, update your CRM, write to Google Sheets, or push to Slack

The trigger payload includes the full transcript (as a structured array with speaker, timestamp, and node information), call duration, outcome, variables, and system metadata. Here's the shape of what you'll receive:

json

{
  "call_id": "interview_response_id",
  "agent_id": "agent_id",
  "duration": 142,
  "outcome": "qualified",
  "transcript": [
    {
      "transcript": "Hello, how can I help you today?",
      "speaker": "ai",
      "createdAt": "2025-11-03T19:33:18.330Z",
      "step": 1,
      "node_id": "node_abc123"
    },
    {
      "transcript": "I'd like to get a quote",
      "speaker": "user",
      "createdAt": "2025-11-03T19:33:25.120Z"
    }
  ],
  "variables": {
    "phone": "+1234567890",
    "intent": "quote_request",
    "location": "Austin, TX"
  }
}

Example On Call Completed payload (simplified)

Enable dispositions for auto-tagging

Thoughtly's disposition feature automatically tags calls based on transcript content and call outcome. Labels like "Qualified lead", "No answer", "Left voicemail", or "Request callback" are applied the moment a call ends. These tags surface in History and can be used for filtering and reporting.

Note: dispositions are the legacy post-call setting. Thoughtly recommends migrating to Automations with On Call Completed for more powerful conditional logic, CRM updates, and multi-step workflows. But dispositions still work for basic call categorization in your call history.

Use variables for structured data capture

Thoughtly variables extract structured data from the conversation — caller intent, location, eligibility, availability — right after the caller replies and before outcomes evaluate. These variables are included in the On Call Completed payload, so you can pipe them into your CRM fields or reporting spreadsheet for measurement.

For example, if your agent captures intent, location, and preferred_time as variables, you can track qualification rates by location or booking rates by intent — not just aggregate call counts.

Step 3: Track the metrics that actually matter

Not all metrics are created equal. Here's a framework organized into four tiers — from operational health to revenue impact. Each tier answers a different question.

Tier 1: Operational health (is the agent functioning?)

Metric	What it measures	Where to find it	Healthy range
Contact rate	% of outbound calls that connect	History (filter by status)	≥ 35% (varies by vertical)
Answer rate (inbound)	% of inbound calls answered by the agent	History / Analytics	≥ 95%
Call duration	Average minutes per call	Analytics → Talk Time	2–6 min (qualification); 5–12 min (booking)
Voicemail detection rate	% of outbound calls that reach voicemail	History (filter: Left Voicemail)	Context-dependent; compare to human dialing
Error / failed rate	% of calls with errors or failed connections	History (filter: Failed)	≤ 3%

These metrics tell you whether your agent is technically working. If contact rate drops below 20% or error rate spikes above 5%, you have an infrastructure or carrier problem to solve before anything else matters.

Tier 2: Conversation quality (is the agent performing well?)

Metric	What it measures	Where to find it	Healthy range
Outcome accuracy	% of calls where the AI outcome matches human review	Manual review in History	≥ 85%
Transfer rate	% of calls escalated to a human	History (filter: Transferred)	10–25% (depends on use case)
Qualification rate	% of connected calls that result in a qualified lead	History (filter by disposition)	30–60% (varies by vertical)
Booking rate	% of connected calls that result in an appointment	History / CRM cross-reference	15–35% (depends on funnel)
Repeat callback rate	% of leads requiring multiple calls to resolve	CRM or contact history	Lower is better; track trend

Conversation quality metrics tell you whether the agent is actually doing its job — qualifying, booking, and routing leads effectively. A high transfer rate isn't inherently bad (some workflows are designed to transfer qualified leads), but if it's climbing over 40%, your agent may be struggling to handle objections or complete tasks autonomously.

Tier 3: Funnel impact (is the agent driving revenue?)

Metric	What it measures	Where to find it	Why it matters
Lead coverage rate	% of inbound leads contacted by the agent	CRM lead count vs. Thoughtly call log	Measures whether you're working 100% of leads
Speed-to-lead	Minutes between lead arrival and first contact	Thoughtly call log + CRM timestamp	Sub-10-minute response doubles conversion odds
Pipeline generated	Dollar value of opportunities created from agent-sourced calls	CRM pipeline report filtered by source	The revenue metric that matters to leadership
Cost per qualified lead	Total Thoughtly cost ÷ number of qualified leads	Billing + outcome data	Compare to CAC benchmarks for your vertical
Agent vs. human lift	Conversion rate of AI-contacted leads vs. human-only leads	A/B test or cohort comparison	Proves the agent adds incremental value

Funnel impact metrics are where you prove ROI. The two that matter most: lead coverage (are you working every lead?) and agent vs. human lift (is the agent converting more than your human team would have on its own?). Thoughtly's Analytics dashboard includes agent-vs-human lift measurement so you can quantify exactly how much pipeline the agent added.

Tier 4: Conversation intelligence (what are leads telling you?)

Beyond quantitative metrics, Thoughtly scores every call on intent, sentiment, and objection type. This conversation quality data is available in the Analytics dashboard and in the call transcript within History. Use it to identify:

Objection patterns — if 30% of qualified leads mention pricing concerns, your agent needs a better pricing response (or your pricing needs adjusting)
Intent distribution — what are callers actually asking for? This should inform your agent's prompt and your marketing strategy
Sentiment trends — are callers getting frustrated at specific points in the conversation? Look for nodes where sentiment drops

Step 4: Build a weekly review cadence

Metrics are useless if nobody looks at them. Set up a weekly review that takes 30 minutes and covers operational health, conversation quality, and funnel impact. Here's a proven structure:

Monday: operational health check (10 minutes)

Open Analytics and review the previous week's Responses, Talk Time, and Usage by agent
Filter History by status: check for Failed, Busy, or No Answer spikes
Verify your On Call Completed automation is firing — check your CRM or webhook logs for recent call records

Wednesday: conversation quality review (10 minutes)

Filter History by outcome: review 3–5 qualified calls
Listen for objection patterns, awkward pauses, or places where the agent misunderstood the caller
Check disposition accuracy — do the auto-tags match what actually happened in the call?

Friday: funnel impact report (10 minutes)

Pull your CRM pipeline report filtered by AI-sourced leads for the week
Compare to the previous week: is pipeline trending up or down?
Calculate cost per qualified lead for the week (Thoughtly billing ÷ qualified lead count)
Share a summary in Slack using a Thoughtly Automation that posts a daily call digest: total calls, conversion rate, hot leads, escalations

The Friday digest is especially powerful for RevOps and sales leaders. You can build it with a Thoughtly Automation that triggers On Call Completed, aggregates results, and posts a structured summary to your team channel at end of day.

Step 5: Attribute pipeline to your AI agents

Attribution is the hardest part of measurement — and the most important. Here's a practical approach that works for high-volume lead conversion funnels:

Source tagging

Every call made by a Thoughtly agent should write a call_source or origin attribute to the CRM contact record. Use Thoughtly's CRM integrations (HubSpot, Salesforce, Pipedrive) to automatically write call activity, outcomes, and recordings to the contact. This creates an auditable trail from call to pipeline.

Cohort comparison

The cleanest way to prove agent lift is a cohort comparison:

Group A: leads handled by the AI agent (speed-to-lead within minutes)
Group B: leads handled by human reps only (standard response time)
Compare booking rate, qualification rate, and time-to-close between the two cohorts over 30–60 days

If you can't run a controlled A/B test, use a pre/post comparison: measure your team's booking rate and pipeline velocity for 30 days before deploying the agent, then compare to the 30 days after. The delta — adjusted for lead volume changes — is your agent lift.

CAC payback

Thoughtly's Analytics dashboard includes CAC payback attribution to the channel and campaign level. This means you can see not just "did the agent book meetings" but "which lead sources produced the highest-value agent-sourced pipeline." Use this to reallocate marketing spend toward the channels that produce the most agent-convertible leads.

Step 6: Export data for deeper analysis

Thoughtly's built-in Analytics and History cover most day-to-day measurement needs. For deeper analysis — cohort tracking, multi-touch attribution, or executive reporting — export History data and combine it with your CRM pipeline report.

To export from History:

Apply the filters you want (agent, date range, status, disposition)
Click Export in the History page
Wait for the export to finish and download the file
Import into Google Sheets, Excel, or your BI tool

For automated reporting, pipe Thoughtly call outputs directly into Google Sheets or Smartsheet using Thoughtly Automations. This gives you a live dashboard of call volume, outcomes, and qualified leads — updated by the second, with no manual export needed.

Common mistakes

Counting calls instead of outcomes

Call volume is the most misleading metric in AI voice. An agent that makes 500 calls but books 2 meetings is underperforming. Always pair volume with outcome rates. The metric that matters is qualified leads per 100 calls — not total calls dialed.

Ignoring contact rate context

If your outbound contact rate is 15%, your agent isn't broken — your lead list quality or calling windows might be. Before blaming the agent, check: are the numbers valid? Are you calling at the right time of day? Is the caller ID flagged as spam? Use Thoughtly's branded calling feature to improve pickup rates on outbound.

Not reviewing transcripts

Numbers tell you what happened. Transcripts tell you why. If your qualification rate dropped from 45% to 30% this week, the transcript is where you'll find the answer. Maybe the agent's prompt changed, maybe the lead mix shifted, or maybe a new objection pattern emerged that the agent isn't handling well. Review at least 5 transcripts per week.

Measuring in isolation

AI agent performance doesn't exist in a vacuum. If your agent books 40 meetings but your sales team only closes 2, the issue might be downstream — not with the agent. Track the full funnel: call → qualified → booked → showed up → closed. The agent's job is to produce qualified, booked leads. The sales team's job is to close them. Measure both.

Over-optimizing for low transfer rates

A low transfer rate sounds good — the agent is handling everything! But if your workflow is designed to transfer hot leads to humans (and it should be), a very low transfer rate might mean the agent is trying to do too much. Define what should be transferred (hot leads, complex cases, booking confirmations) and measure against that expectation, not against an arbitrary "lower is better" benchmark.

Measuring success

Success means different things at different stages of your AI agent deployment. Here's how to think about it:

Stage	Timeframe	Success looks like	Key metric
Week 1–2	Launch	Agent is functional, calls are completing, data is flowing to CRM	Contact rate, error rate, CRM sync rate
Month 1	Optimization	Outcome accuracy is improving, qualification rate is stable	Qualification rate, transfer rate, disposition accuracy
Month 2–3	Impact	Agent-sourced pipeline is measurable and trending up	Pipeline generated, cost per qualified lead
Month 3+	Scale	Agent vs. human lift is proven, coverage is 95%+	Lead coverage rate, agent lift %, CAC payback

The single most important number for most revenue teams: qualified leads per 100 inbound leads. This captures both coverage (are you contacting every lead?) and quality (is the agent qualifying the right ones?). Track it weekly. If it's trending up, your agent is working.

Frequently asked questions

How many calls do I need before I can measure performance?

You need at least 50 completed calls to get statistically meaningful outcome rates. For cohort comparisons (agent vs. human), aim for 200+ calls per cohort. Below 50 calls, individual call quality matters more than aggregate metrics — focus on transcript review.

What's a good qualification rate for AI voice agents?

It depends heavily on your vertical and lead source. For inbound speed-to-lead in insurance or mortgage, 30–50% qualification rates are common. For re-engagement of aged leads, 10–20% is typical. For appointment setting with warm leads, 40–60% is achievable. Benchmark against your human team's performance — if the agent's qualification rate is within 10% of your human average, you're in good shape.

How do I measure conversation quality without listening to every call?

Use Thoughtly's conversation quality scoring, which evaluates every call on intent, sentiment, and objection type. Filter History by low-sentiment calls or specific objection types to find the 5–10 calls worth reviewing manually each week. Dispositions also help — if a call was tagged "Request callback", that's a signal the conversation didn't resolve cleanly.

Should I track cost per call or cost per qualified lead?

Cost per qualified lead (CPQL) is the better metric. Cost per call rewards the agent for making more calls, not for producing results. CPQL = total Thoughtly spend ÷ qualified lead count. If you spend $500 on the agent in a week and get 25 qualified leads, your CPQL is $20. Compare that to your CAC benchmark — if it's lower, the agent is adding value.

Can I track ROI without a CRM?

You can track operational metrics (contact rate, qualification rate, booking rate) using Thoughtly History and Analytics alone. But to track pipeline and revenue impact, you need a system of record for opportunities — whether that's HubSpot, Salesforce, or even a Google Sheet. Without a CRM, you can measure activity but not impact.

How often should I update my agent's prompt based on measurement?

Review transcripts weekly, but don't change the prompt more than once every 2–3 weeks. Frequent prompt changes make it hard to isolate what's working. When you do change the prompt, keep a version log and compare outcome rates before and after. If a change drops your qualification rate by more than 5 percentage points, revert.

Sources and further reading

— funnels, attribution, and conversation quality dashboards
— how to use built-in dashboards for responses, talk time, and usage
— filtering, searching, and exporting call records
— trigger configuration and payload structure for post-call data capture
— companion guide on navigating the Analytics dashboard
— measuring the incremental value of AI follow-up vs. no follow-up
— pre-launch testing framework to catch issues before they hit production
— extracting structured data from conversations for measurement
— defining branching logic that determines call routing and disposition

How to Measure AI Voice Agent Performance: A Complete Framework for Revenue Teams

How to Measure AI Voice Agent Performance: A Complete Framework for Revenue Teams

What you'll learn

What you'll need

Step 1: Define your outcome taxonomy before measuring anything

Step 2: Set up your measurement infrastructure in Thoughtly

Configure the On Call Completed trigger

Enable dispositions for auto-tagging

Use variables for structured data capture

Step 3: Track the metrics that actually matter

Tier 1: Operational health (is the agent functioning?)

Tier 2: Conversation quality (is the agent performing well?)

Tier 3: Funnel impact (is the agent driving revenue?)

Tier 4: Conversation intelligence (what are leads telling you?)

Step 4: Build a weekly review cadence

Monday: operational health check (10 minutes)

Wednesday: conversation quality review (10 minutes)

Friday: funnel impact report (10 minutes)

Step 5: Attribute pipeline to your AI agents

Source tagging

Cohort comparison

CAC payback

Step 6: Export data for deeper analysis

Common mistakes

Counting calls instead of outcomes

Ignoring contact rate context

Not reviewing transcripts

Measuring in isolation

Over-optimizing for low transfer rates

Measuring success

Frequently asked questions

How many calls do I need before I can measure performance?

What's a good qualification rate for AI voice agents?

How do I measure conversation quality without listening to every call?

Should I track cost per call or cost per qualified lead?

Can I track ROI without a CRM?

How often should I update my agent's prompt based on measurement?

Sources and further reading

Keep reading

How to Automate Phone Calls with AI Voice Agents for Lead Conversion

How to Build an AI Agent That Collects Payments Over the Phone

How to Set Up Contact-Level Context for Personalized AI Calls

Every lead called instantly. Every conversation handled perfectly.