Most companies I speak to are not using the right metrics to measure the success of their GEO work.
Mention rate and citation rate are most often used, but these don't answer the core question: "how are we showing up in AI chats?".
73% of B2B buyers use AI tools in purchasing research according to Averi’s March 2026 research. Being mentioned means you're in the game, but it doesn't tell you if you're winning.
The solution is simple in theory but tricky in practice. Capture and store AI responses to specific prompts. The actual text, the citations, the model and timestamp. And then analyze it.
In this post we’ll cover how and why we do this for our clients.
Mention rate is a diagnostic, not a KPI
Every vendor I’ve seen in the GEO measurement space — Profound, Otterly, HubSpot AEO, SE Ranking — sells roughly the same metric stack. Defined prompts. Mention rate. Citation rate. Share of voice. That may have changed by the time you read this, but that’s what I’ve seen to date.
Those numbers are useful. They tell you how often you're in the conversation. They give a dashboard something easy and digestible to trend. We use them too.
What mention rate doesn't tell you is how your company or products are represented to buyers. It doesn't tell you whether you were recommended, your product features were listed, or which competitors were also mentioned.
That contextual information is what really answers the core question "how are we showing up in AI?". It's also what's needed to inform what your team should produce and publish next.
Increasing AI mention rate by 10% could be a measure of success. I'd argue a better measure of success is a 10% increase in how often AI recommends your company on prompts your buyers are likely to use.
💬 Being mentioned means you're in the game. It doesn't tell you if you're winning.
A three step workflow
We can break the workflow into three steps: (1) defining prompts to track (2) storing AI responses to those prompts (3) analyzing AI responses to those prompts.
The first step is where real strategic input is required.
Using AI to spit out a list of prompts to track is a shortcut that undermines the whole workflow. AI-drafted prompts are biased; they are unlikely to be written in ways people actually type and speak (too structured and verbose), and they are unlikely to be written with sufficient context.
A human strategist needs to spend time curating prompts. Ideally, buyer personas are documented and used; internal stakeholders provide input; and tracked prompts have logical groupings that carry through to analysis.
💡 Write success criteria when you decide to track a prompt. "What does winning this response look like?" is the cheapest question to answer when the prompt is fresh — and if you can't articulate it, the prompt isn't ready to track.
The easy part - storing AI responses
Once a ‘north star’ list of tracked prompts is established, an agent can be spun up that sends them to top frontier models and logs response and citation text.
I usually use DataForSEO's LLM APIs for this step and pay a fraction of what seats in vendor tools cost. The LLM Responses endpoint returns the full text of an AI answer plus its citations, on demand, across ChatGPT, Claude, Gemini, and Perplexity. Pricing is pay-as-you-go, so you scale with prompt volume instead of buying seats.
I pipe the responses into a flat store — one row per prompt-model-day, with full text, citations, and metadata. Where and how we store it differs based on a client's tools, but local txt files work well enough. The exact shape of the store matters less than the discipline of keeping it: every response, every run, never overwrite.
The hard part - analyzing AI responses
Mention and citation rate are easy to calculate. Measuring “how are we showing up?” is not. AI response text needs to be classified in a structured and meaningful way. This is why most vendors stop at mention rate.
I build custom agents for clients that review response text and append two things: (1) a 'performance score' from 0-100, and (2) binary fields for client-specific tracking dimensions.
When defining the list of prompts I track for a client, I capture notes on ‘what success looks like’ for my client for that response. Being the top vendor mentioned. A major product release being accurately summarized. Them being mentioned as regulated by a particular entity.
These ‘what success looks like’ notes are used by an agent, in conjunction with its existing business context layer, to score each response on 0-100 and write a 1 sentence rationale for its score. We can then measure the success score for tracked prompts over time in our KPI dashboards, grouping it by prompt category.
From metrics to action
Prompt performance scores improve your other agents that plan content and site updates. It gives those agents a clean signal on gaps. Without it, a gap is just a prompt where you don’t show up. With it, a gap is prompt that under-performs with your buyers in specific ways.
The time invested building this scoring system doesn’t just pay out in a dashboard. It gives an entire stack of agents structured data and context to improve performance.