OpenMid

Evaluator

Why We Built This

The x402 protocol enables AI agents to discover and pay for API access autonomously. But discovery alone isn't enough—agents need to know which APIs actually work, return useful data, and are worth paying for.

We built the OpenMid Evaluator to solve three core problems:

  • Trust Gap — With 12,000+ x402 endpoints indexed, there's no way to manually verify which ones work. Agents need automated quality signals.
  • Wasted Payments — Agents paying for broken or useless APIs waste money and degrade user experience. Quality metrics prevent this.
  • Agent-Specific Needs — Traditional API monitoring checks "does it return 200?" but agents need to know "can I use this response to complete my task?"
The Evaluator makes real x402 payments to test endpoints, measuring not just availability but actual usefulness for AI agents.

Evaluation Criteria

Before an endpoint is evaluated, it must pass through our eligibility pipeline. This ensures we only spend resources testing endpoints that are likely to be useful.

Eligibility Requirements

CriteriaRequirement
NetworkBase mainnet only (no testnets)
Recent ActivityMust have transactions in past 7 days
Description QualityMust have action verb + specific service description
URL LegitimacyNo deployment platforms, random UUIDs, or suspicious domains
Price≤ $0.05 USDC per call

What We Test

For each eligible endpoint, the Evaluator:

  1. Generates a realistic request payload using the endpoint's input schema
  2. Makes a real x402 payment and captures the response
  3. Measures response time, status code, and content type
  4. Evaluates whether the response is useful for the intended task
  5. Tracks consecutive failures to detect degraded endpoints

Metrics We Track

We track five core metrics that together form a comprehensive picture of API quality from an AI agent's perspective.

1

Success Rate

35% weight

What it measures: The percentage of evaluation runs that returned a successful HTTP response (2xx status code).

Why it matters: The most fundamental signal—if an API doesn't respond successfully, nothing else matters. Low success rates indicate unreliable infrastructure or broken endpoints.

2

Agent Input Friendliness

20% weight

What it measures: Whether auto-generated request inputs (based on the API's schema) produce useful outputs.

Why it matters: AI agents generate requests programmatically. If an API requires very specific inputs that are hard to infer from the schema, agents will struggle to use it effectively.

Unfriendly signals: 404 responses, empty bodies ({} or []), "not found" messages, client errors (4xx)
3

Output Usefulness

30% weight

What it measures: How helpful the API response is for completing the intended task. Scored 0-5.

Why it matters: An API can return 200 OK but still be useless. This metric answers the real question: "Can an agent use this response to accomplish what the user asked for?"

5Fully helpful — directly answers the request
4Mostly helpful — minor missing fields
3Partially helpful — incomplete, needs more
2Barely helpful — only metadata or partial info
1Not helpful — empty, wrong type, fails task
4

Response Time

10% weight

What it measures: End-to-end latency from request initiation to response completion, in milliseconds.

Why it matters: Agents often chain multiple API calls. Slow endpoints create poor user experiences and can cause timeouts in agentic workflows.

5

Stability

5% weight

What it measures: Consistency of successful responses over time, derived from the success rate.

Why it matters: A stable endpoint is one that consistently works. This metric penalizes endpoints with erratic behavior even if their average success rate appears acceptable.

Questions?

View the Dashboard

See live evaluation results for x402 endpoints

Open Dashboard

Join our Telegram

Ask questions or report issues

Join Telegram