Guides
A practical framework for measuring AI voice agent performance — from operational health and conversation quality to pipeline impact and ROI attribution. Covers the metrics, tools, and weekly cadence revenue teams need.
Last updated
You deployed an AI voiceAI voiceAn artificially generated, natural-sounding voice produced by a TTS model. Thoughtly supports a library of AI voices and brand-specific cloning. agent. Calls are happening. But are they working? Without a measurement framework, you're left counting call volume and hoping the pipeline shows up. This guide walks through the metrics, tools, and cadence revenue teams need to evaluate AI voice agent performance — from operational health to revenue impact.
Whether you're running inbound speed-to-lead agents, re-engagement campaigns, or appointment-setting workflows, the same principle applies: a completed call is not the same as a successful outcome. You need to measure what matters for your funnel, not just what's easy to count.
On Call Completed trigger configured for post-call data captureBefore you can measure performance, you need to know what "good" looks like. Define the outcomes that matter for your specific funnel. In Thoughtly, outcomes are the branching labels that route conversations after each caller response — but at the call level, you also have dispositions that auto-tag calls based on transcriptTranscriptThe text record of a voice conversation, used for review, training, compliance audit, and search. analysis.
For a typical inbound lead conversionInbound lead conversionThe process of turning opted-in inquiries, form fills, calls, and quote requests into qualified conversations, appointments, or transfers. funnel (insurance, mortgage, education, healthcare, home services), your outcome taxonomy should include:
The key decision: what counts as a "successful" call? For most lead conversion workflows, booked and qualified are your success outcomes. Everything else is either a process outcome (callback, transfer) or a non-outcome (no answer, voicemail, DNQ). This distinction drives every downstream metric.
Thoughtly gives you three layers of measurement: Analytics for aggregate dashboards, History for per-call detail, and Automations for piping call data into your CRMCRMThe system of record for leads, contacts, deals, and activity. Thoughtly reads from and writes to your CRM continuously. or reporting tools. Here's how to wire them together.
In Thoughtly Automations, the On Call Completed triggerTriggerThe event or condition that starts an automated workflow, such as a new lead, missed call, CRM status change, calendar booking, or completed call. fires after every call ends. It outputs rich call data: durations, outcomes, variables captured, action flags, transfers, and voicemail detectionVoicemail detectionVoicemail detection is the ability to identify when a call reaches a voicemail greeting instead of a live person, then trigger the right message, callback, or alternate-channel follow-up.. This is your primary data pipeline for measurement.
To set it up:
Thoughtly → On Call Completed as the triggerThe trigger payload includes the full transcript (as a structured array with speaker, timestamp, and node information), call duration, outcome, variables, and system metadata. Here's the shape of what you'll receive:
{
"call_id": "interview_response_id",
"agent_id": "agent_id",
"duration": 142,
"outcome": "qualified",
"transcript": [
{
"transcript": "Hello, how can I help you today?",
"speaker": "ai",
"createdAt": "2025-11-03T19:33:18.330Z",
"step": 1,
"node_id": "node_abc123"
},
{
"transcript": "I'd like to get a quote",
"speaker": "user",
"createdAt": "2025-11-03T19:33:25.120Z"
}
],
"variables": {
"phone": "+1234567890",
"intent": "quote_request",
"location": "Austin, TX"
}
}Thoughtly's disposition feature automatically tags calls based on transcript content and call outcome. Labels like "Qualified lead", "No answer", "Left voicemail", or "Request callback" are applied the moment a call ends. These tags surface in History and can be used for filtering and reporting.
Note: dispositions are the legacy post-call setting. Thoughtly recommends migrating to Automations with On Call Completed for more powerful conditional logic, CRM updates, and multi-step workflows. But dispositions still work for basic call categorization in your call history.
Thoughtly variables extract structured data from the conversation — caller intent, location, eligibilityEligibilityThe fit criteria that determine whether a prospect can move forward, such as service area, insurance coverage, loan type, location, age, or program requirements., availability — right after the caller replies and before outcomes evaluate. These variables are included in the On Call Completed payload, so you can pipe them into your CRM fields or reporting spreadsheet for measurement.
For example, if your agent captures intent, location, and preferred_time as variables, you can track qualification rates by location or booking rates by intent — not just aggregate call counts.
Not all metrics are created equal. Here's a framework organized into four tiers — from operational health to revenue impact. Each tier answers a different question.
| Metric | What it measures | Where to find it | Healthy range |
|---|---|---|---|
| Contact rate | % of outbound calls that connect | History (filter by status) | ≥ 35% (varies by vertical) |
| Answer rate (inbound) | % of inbound calls answered by the agent | History / Analytics | ≥ 95% |
| Call duration | Average minutes per call | Analytics → Talk Time | 2–6 min (qualification); 5–12 min (booking) |
| Voicemail detection rate | % of outbound calls that reach voicemail | History (filter: Left Voicemail) | Context-dependent; compare to human dialing |
| Error / failed rate | % of calls with errors or failed connections | History (filter: Failed) | ≤ 3% |
These metrics tell you whether your agent is technically working. If contact rateContact rateThe percentage of inbound leads your team actually reaches by phone. Most B2C teams hover around 25%; Thoughtly typically delivers 90%+. drops below 20% or error rate spikes above 5%, you have an infrastructure or carrierCarrierA telecommunications provider that routes phone calls and SMS over its network. Twilio, Telnyx, and Bandwidth are the three most common in the AI voice space. problem to solve before anything else matters.
| Metric | What it measures | Where to find it | Healthy range |
|---|---|---|---|
| Outcome accuracy | % of calls where the AI outcome matches human review | Manual review in History | ≥ 85% |
| Transfer rate | % of calls escalated to a human | History (filter: Transferred) | 10–25% (depends on use case) |
| Qualification rate | % of connected calls that result in a qualified lead | History (filter by disposition) | 30–60% (varies by vertical) |
| Booking rate | % of connected calls that result in an appointment | History / CRM cross-reference | 15–35% (depends on funnel) |
| Repeat callback rate | % of leads requiring multiple calls to resolve | CRM or contact history | Lower is better; track trend |
Conversation quality metrics tell you whether the agent is actually doing its job — qualifying, booking, and routing leads effectively. A high transfer rate isn't inherently bad (some workflows are designed to transfer qualified leads), but if it's climbing over 40%, your agent may be struggling to handle objections or complete tasks autonomously.
| Metric | What it measures | Where to find it | Why it matters |
|---|---|---|---|
| Lead coverage rate | % of inbound leads contacted by the agent | CRM lead count vs. Thoughtly call log | Measures whether you're working 100% of leads |
| Speed-to-lead | Minutes between lead arrival and first contact | Thoughtly call log + CRM timestamp | Sub-10-minute response doubles conversion odds |
| Pipeline generated | Dollar value of opportunities created from agent-sourced calls | CRM pipeline report filtered by source | The revenue metric that matters to leadership |
| Cost per qualified lead | Total Thoughtly cost ÷ number of qualified leads | Billing + outcome data | Compare to CAC benchmarks for your vertical |
| Agent vs. human lift | Conversion rate of AI-contacted leads vs. human-only leads | A/B test or cohort comparison | Proves the agent adds incremental value |
Funnel impact metrics are where you prove ROI. The two that matter most: lead coverage (are you working every lead?) and agent vs. human lift (is the agent converting more than your human team would have on its own?). Thoughtly's Analytics dashboard includes agent-vs-human lift measurement so you can quantify exactly how much pipeline the agent added.
Beyond quantitative metrics, Thoughtly scores every call on intent, sentiment, and objection type. This conversation quality data is available in the Analytics dashboard and in the call transcript within History. Use it to identify:
Metrics are useless if nobody looks at them. Set up a weekly review that takes 30 minutes and covers operational health, conversation quality, and funnel impact. Here's a proven structure:
On Call Completed automation is firing — check your CRM or webhook logs for recent call recordsqualified callsThe Friday digest is especially powerful for RevOps and sales leaders. You can build it with a Thoughtly Automation that triggers On Call Completed, aggregates results, and posts a structured summary to your team channel at end of day.
Attribution is the hardest part of measurement — and the most important. Here's a practical approach that works for high-volume lead conversion funnels:
Every call made by a Thoughtly agent should write a call_source or origin attribute to the CRM contact record. Use Thoughtly's CRM integrations (HubSpot, Salesforce, Pipedrive) to automatically write call activity, outcomes, and recordings to the contact. This creates an auditable trail from call to pipeline.
The cleanest way to prove agent lift is a cohort comparison:
If you can't run a controlled A/B test, use a pre/post comparison: measure your team's booking rate and pipeline velocity for 30 days before deploying the agent, then compare to the 30 days after. The delta — adjusted for lead volume changes — is your agent lift.
Thoughtly's Analytics dashboard includes CAC payback attribution to the channel and campaign level. This means you can see not just "did the agent book meetings" but "which lead sources produced the highest-value agent-sourced pipeline." Use this to reallocate marketing spend toward the channels that produce the most agent-convertible leads.
Thoughtly's built-in Analytics and History cover most day-to-day measurement needs. For deeper analysis — cohort tracking, multi-touch attribution, or executive reporting — export History data and combine it with your CRM pipeline report.
To export from History:
Export in the History pageFor automated reporting, pipe Thoughtly call outputs directly into Google Sheets or Smartsheet using Thoughtly Automations. This gives you a live dashboard of call volume, outcomes, and qualified leads — updated by the second, with no manual export needed.
Call volume is the most misleading metric in AI voice. An agent that makes 500 calls but books 2 meetings is underperforming. Always pair volume with outcome rates. The metric that matters is qualified leads per 100 calls — not total calls dialed.
If your outbound contact rate is 15%, your agent isn't broken — your lead list quality or calling windows might be. Before blaming the agent, check: are the numbers valid? Are you calling at the right time of day? Is the caller ID flagged as spam? Use Thoughtly's branded callingBranded callingDisplaying a verified business name, logo, or call reason on the recipient’s phone so legitimate calls are less likely to be ignored or flagged as spam. feature to improve pickup rates on outbound.
Numbers tell you what happened. Transcripts tell you why. If your qualification rate dropped from 45% to 30% this week, the transcript is where you'll find the answer. Maybe the agent's prompt changed, maybe the lead mix shifted, or maybe a new objection pattern emerged that the agent isn't handling well. Review at least 5 transcripts per week.
AI agent performance doesn't exist in a vacuum. If your agent books 40 meetings but your sales team only closes 2, the issue might be downstream — not with the agent. Track the full funnel: call → qualified → booked → showed up → closed. The agent's job is to produce qualified, booked leads. The sales team's job is to close them. Measure both.
A low transfer rate sounds good — the agent is handling everything! But if your workflowWorkflowAn automated, multi-step process — usually triggered by an event (form fill, new lead) and orchestrating one or more voice / SMS / email actions. is designed to transfer hot leads to humans (and it should be), a very low transfer rate might mean the agent is trying to do too much. Define what should be transferred (hot leads, complex cases, booking confirmations) and measure against that expectation, not against an arbitrary "lower is better" benchmark.
Success means different things at different stages of your AI agent deployment. Here's how to think about it:
| Stage | Timeframe | Success looks like | Key metric |
|---|---|---|---|
| Week 1–2 | Launch | Agent is functional, calls are completing, data is flowing to CRM | Contact rate, error rate, CRM sync rate |
| Month 1 | Optimization | Outcome accuracy is improving, qualification rate is stable | Qualification rate, transfer rate, disposition accuracy |
| Month 2–3 | Impact | Agent-sourced pipeline is measurable and trending up | Pipeline generated, cost per qualified lead |
| Month 3+ | Scale | Agent vs. human lift is proven, coverage is 95%+ | Lead coverage rate, agent lift %, CAC payback |
The single most important number for most revenue teams: qualified leads per 100 inbound leads. This captures both coverage (are you contacting every lead?) and quality (is the agent qualifying the right ones?). Track it weekly. If it's trending up, your agent is working.
You need at least 50 completed calls to get statistically meaningful outcome rates. For cohort comparisons (agent vs. human), aim for 200+ calls per cohort. Below 50 calls, individual call quality matters more than aggregate metrics — focus on transcript review.
It depends heavily on your vertical and lead sourceLead sourceThe channel, campaign, marketplace, referral partner, or form that generated a lead. Lead source often determines routing, compliance rules, and follow-up cadence.. For inbound speed-to-lead in insurance or mortgage, 30–50% qualification rates are common. For re-engagement of aged leads, 10–20% is typical. For appointment settingAppointment settingCapturing availability, confirming fit, and booking a qualified prospect onto the right calendar without requiring a rep to manually chase the lead. with warm leads, 40–60% is achievable. Benchmark against your human team's performance — if the agent's qualification rate is within 10% of your human average, you're in good shape.
Use Thoughtly's conversation quality scoring, which evaluates every call on intent, sentiment, and objection type. Filter History by low-sentiment calls or specific objection types to find the 5–10 calls worth reviewing manually each week. Dispositions also help — if a call was tagged "Request callback", that's a signal the conversation didn't resolve cleanly.
Cost per qualified lead (CPQL) is the better metric. Cost per call rewards the agent for making more calls, not for producing results. CPQL = total Thoughtly spend ÷ qualified lead count. If you spend $500 on the agent in a week and get 25 qualified leads, your CPQL is $20. Compare that to your CAC benchmark — if it's lower, the agent is adding value.
You can track operational metrics (contact rate, qualification rate, booking rate) using Thoughtly History and Analytics alone. But to track pipeline and revenue impact, you need a system of recordSystem of recordThe authoritative system where customer, lead, policy, loan, appointment, or account data is stored and updated. for opportunities — whether that's HubSpot, Salesforce, or even a Google Sheet. Without a CRM, you can measure activity but not impact.
Review transcripts weekly, but don't change the prompt more than once every 2–3 weeks. Frequent prompt changes make it hard to isolate what's working. When you do change the prompt, keep a version log and compare outcome rates before and after. If a change drops your qualification rate by more than 5 percentage points, revert.