Framework

The Eval-First Principle

Build your evaluation harness before building your AI system. Without measurement, you're optimizing for vibes—and vibes don't scale.

AI EngineeringEvaluationBest Practices

The Core Idea

Before building any AI feature, build the system that tells you if it works.

Most AI projects fail not because the technology doesn't work, but because teams can't tell if their changes make things better or worse. They're flying blind, optimizing for "feels right" instead of measurable outcomes.

The Eval-First Principle fixes this: treat your evaluation harness as the first feature, not an afterthought.

Why This Matters

The Vibes Problem

Here's how most AI development goes:

Build a prototype
Test it manually a few times
"Looks good to me" ✓
Ship it
Users complain about edge cases
Fix edge cases (break other things)
Repeat forever

This works when you have 10 users. It breaks catastrophically when you have 10,000.

The Hidden Cost of No Evals

Without evals:

Every change requires manual testing (slow)
You can't A/B test prompts confidently
Regressions slip through unnoticed
You can't prove ROI to stakeholders
Debugging is "change random things and hope"

With evals:

Changes verified automatically in minutes
Prompt optimization becomes data-driven
Regressions caught before production
Clear metrics for stakeholders
Debugging follows cause-and-effect

The Framework

Level 1: Unit Evals (Build First)

Test individual components in isolation.

For RAG systems:

def test_retrieval_precision():
    """Does the retriever find the right documents?"""
    query = "What's our vacation policy?"
    results = retriever.search(query, k=5)
    
    # The correct document should be in top 3
    assert "vacation-policy-2024.pdf" in [r.source for r in results[:3]]

For LLM responses:

def test_response_contains_source():
    """Does the response cite its sources?"""
    response = generate_answer("What's the return deadline?")
    
    # Response must include citation
    assert "[Source:" in response or "according to" in response.lower()

Coverage needed: 20-50 test cases covering critical functionality.

Level 2: Integration Evals (Build Second)

Test end-to-end flows with real-world scenarios.

For a customer support bot:

eval_set = [
    {
        "query": "I want to cancel my subscription",
        "expected_intent": "cancellation",
        "must_include": ["cancel", "refund policy"],
        "must_not_include": ["upgrade", "new features"]
    },
    {
        "query": "How do I upgrade to premium?",
        "expected_intent": "upgrade",
        "must_include": ["premium", "pricing"],
        "must_not_include": ["cancel", "refund"]
    }
]

def test_end_to_end():
    for test in eval_set:
        response = chatbot.respond(test["query"])
        
        # Check intent classification
        assert response.intent == test["expected_intent"]
        
        # Check content requirements
        for phrase in test["must_include"]:
            assert phrase.lower() in response.text.lower()

Coverage needed: 100-500 test cases covering user journeys.

Level 3: Human Evals (Build Third)

Systematic human review for subjective quality.

Rating rubric example:

Accuracy (1-5):
1 = Factually incorrect
3 = Mostly correct, minor issues  
5 = Completely accurate

Helpfulness (1-5):
1 = Doesn't address the question
3 = Partially addresses the question
5 = Fully addresses with actionable info

Tone (1-5):
1 = Inappropriate or confusing
3 = Acceptable but generic
5 = Professional and on-brand

Sample size: 50-100 responses per evaluation cycle, rotated through multiple reviewers.

Level 4: Production Evals (Build Last)

Real user feedback from production traffic.

Metrics to track:

👍/👎 explicit feedback rate
Conversation abandonment rate
Escalation to human rate
Time to resolution
Repeat query rate (same user, same topic)

Implementation:

@app.post("/feedback")
def log_feedback(conversation_id: str, helpful: bool):
    log_metric({
        "conversation_id": conversation_id,
        "helpful": helpful,
        "timestamp": datetime.now(),
        "response_version": MODEL_VERSION
    })

The Eval-First Development Cycle

Define success criteria → What does "working" mean?
Build eval harness → How will you measure it?
Create baseline → How does the current system perform?
Build feature → Implement the change
Run evals → Does it improve metrics?
Ship or iterate → Evidence-based decision

Common Objections

"We don't have time to build evals first"

You don't have time not to. Without evals, you'll spend 10x more time debugging, reverting, and apologizing to users.

"AI is too unpredictable to test"

AI outputs are stochastic, but their quality can be measured. Use fuzzy matching, semantic similarity, or LLM-as-judge techniques.

"We'll add evals after we ship"

You won't. Once the pressure of shipping passes, evals become "technical debt" that never gets paid. Build them first.

"Our use case is too subjective"

Even subjective quality can be measured. Use human evaluation with clear rubrics. If humans can't agree on quality, your requirements aren't clear enough.

Building Your First Eval Harness

Minimum Viable Eval (Start Here)

# eval_harness.py
import json
from datetime import datetime

class SimpleEval:
    def __init__(self, eval_set_path):
        with open(eval_set_path) as f:
            self.eval_set = json.load(f)
    
    def run(self, system_under_test):
        results = []
        for test_case in self.eval_set:
            response = system_under_test(test_case["input"])
            passed = self.check(response, test_case["expected"])
            results.append({
                "input": test_case["input"],
                "expected": test_case["expected"],
                "actual": response,
                "passed": passed
            })
        
        return {
            "timestamp": datetime.now().isoformat(),
            "total": len(results),
            "passed": sum(1 for r in results if r["passed"]),
            "failed": sum(1 for r in results if not r["passed"]),
            "details": results
        }
    
    def check(self, actual, expected):
        # Override this for your use case
        return expected.lower() in actual.lower()

Eval Set Format

[
    {
        "input": "What are your business hours?",
        "expected": "9 AM to 5 PM",
        "category": "faq",
        "priority": "high"
    },
    {
        "input": "Do you ship internationally?",
        "expected": "international shipping",
        "category": "faq",
        "priority": "medium"
    }
]

The Payoff

Teams that adopt Eval-First:

Ship faster — Automated testing eliminates manual QA bottleneck
Ship safer — Regressions caught before users see them
Iterate smarter — Data-driven prompt optimization
Scale confidently — Evidence that the system works

Teams that don't:

Spend 80% of time debugging production issues
Lose user trust through inconsistent behavior
Can't explain system performance to stakeholders
Eventually rewrite everything from scratch

Conclusion

The Eval-First Principle isn't about having perfect tests—it's about having any systematic measurement from day one.

The question isn't "should we build evals?" It's "can we afford not to?"

The best AI teams I've worked with spend 30% of their time on evaluation infrastructure. The worst spend 0% on evals and 80% fighting fires.

What's your eval strategy?

AM

Abhinav Mahajan

AI Product & Engineering Leader

Building AI systems that work in production. These frameworks come from real experience shipping enterprise AI products.

Continue Exploring

Writing

essay

Why RAG is Harder Than It Looks

Retrieval-Augmented Generation seems simple in demos but breaks in a dozen ways in production. Here's why most RAG projects fail and what to do about it.

post

Prompt Engineering Patterns That Actually Work

After writing thousands of prompts, these are the patterns that consistently improve results. No magic tricks—just engineering principles.

💡 Apply This Framework

Find This Framework Useful?

I'd love to hear how you've applied it or discuss related ideas. Let's explore how these principles apply to your specific context.

Get in Touch Explore More Ideas