Evaluator
Why We Built This
The x402 protocol enables AI agents to discover and pay for API access autonomously. But discovery alone isn't enough—agents need to know which APIs actually work, return useful data, and are worth paying for.
We built the OpenMid Evaluator to solve three core problems:
- Trust Gap — With 12,000+ x402 endpoints indexed, there's no way to manually verify which ones work. Agents need automated quality signals.
- Wasted Payments — Agents paying for broken or useless APIs waste money and degrade user experience. Quality metrics prevent this.
- Agent-Specific Needs — Traditional API monitoring checks "does it return 200?" but agents need to know "can I use this response to complete my task?"
Evaluation Criteria
Before an endpoint is evaluated, it must pass through our eligibility pipeline. This ensures we only spend resources testing endpoints that are likely to be useful.
Eligibility Requirements
| Criteria | Requirement |
|---|---|
| Network | Base mainnet only (no testnets) |
| Recent Activity | Must have transactions in past 7 days |
| Description Quality | Must have action verb + specific service description |
| URL Legitimacy | No deployment platforms, random UUIDs, or suspicious domains |
| Price | ≤ $0.05 USDC per call |
What We Test
For each eligible endpoint, the Evaluator:
- Generates a realistic request payload using the endpoint's input schema
- Makes a real x402 payment and captures the response
- Measures response time, status code, and content type
- Evaluates whether the response is useful for the intended task
- Tracks consecutive failures to detect degraded endpoints
Metrics We Track
We track five core metrics that together form a comprehensive picture of API quality from an AI agent's perspective.
Success Rate
35% weightWhat it measures: The percentage of evaluation runs that returned a successful HTTP response (2xx status code).
Why it matters: The most fundamental signal—if an API doesn't respond successfully, nothing else matters. Low success rates indicate unreliable infrastructure or broken endpoints.
Agent Input Friendliness
20% weightWhat it measures: Whether auto-generated request inputs (based on the API's schema) produce useful outputs.
Why it matters: AI agents generate requests programmatically. If an API requires very specific inputs that are hard to infer from the schema, agents will struggle to use it effectively.
{} or []), "not found" messages, client errors (4xx)Output Usefulness
30% weightWhat it measures: How helpful the API response is for completing the intended task. Scored 0-5.
Why it matters: An API can return 200 OK but still be useless. This metric answers the real question: "Can an agent use this response to accomplish what the user asked for?"
| 5 | Fully helpful — directly answers the request |
| 4 | Mostly helpful — minor missing fields |
| 3 | Partially helpful — incomplete, needs more |
| 2 | Barely helpful — only metadata or partial info |
| 1 | Not helpful — empty, wrong type, fails task |
Response Time
10% weightWhat it measures: End-to-end latency from request initiation to response completion, in milliseconds.
Why it matters: Agents often chain multiple API calls. Slow endpoints create poor user experiences and can cause timeouts in agentic workflows.
Stability
5% weightWhat it measures: Consistency of successful responses over time, derived from the success rate.
Why it matters: A stable endpoint is one that consistently works. This metric penalizes endpoints with erratic behavior even if their average success rate appears acceptable.