Back to all ideas
Framework

The Eval-First Principle

Build your evaluation harness before building your AI system. Without measurement, you're optimizing for vibes—and vibes don't scale.

AI EngineeringEvaluationBest Practices

The Core Idea

Before building any AI feature, build the system that tells you if it works.

Most AI projects fail not because the technology doesn't work, but because teams can't tell if their changes make things better or worse. They're flying blind, optimizing for "feels right" instead of measurable outcomes.

The Eval-First Principle fixes this: treat your evaluation harness as the first feature, not an afterthought.

Why This Matters

The Vibes Problem

Here's how most AI development goes:

  1. Build a prototype
  2. Test it manually a few times
  3. "Looks good to me" ✓
  4. Ship it
  5. Users complain about edge cases
  6. Fix edge cases (break other things)
  7. Repeat forever

This works when you have 10 users. It breaks catastrophically when you have 10,000.

The Hidden Cost of No Evals

Without evals:

  • Every change requires manual testing (slow)
  • You can't A/B test prompts confidently
  • Regressions slip through unnoticed
  • You can't prove ROI to stakeholders
  • Debugging is "change random things and hope"

With evals:

  • Changes verified automatically in minutes
  • Prompt optimization becomes data-driven
  • Regressions caught before production
  • Clear metrics for stakeholders
  • Debugging follows cause-and-effect

The Framework

Level 1: Unit Evals (Build First)

Test individual components in isolation.

For RAG systems:

def test_retrieval_precision():
    """Does the retriever find the right documents?"""
    query = "What's our vacation policy?"
    results = retriever.search(query, k=5)
    
    # The correct document should be in top 3
    assert "vacation-policy-2024.pdf" in [r.source for r in results[:3]]

For LLM responses:

def test_response_contains_source():
    """Does the response cite its sources?"""
    response = generate_answer("What's the return deadline?")
    
    # Response must include citation
    assert "[Source:" in response or "according to" in response.lower()

Coverage needed: 20-50 test cases covering critical functionality.


Level 2: Integration Evals (Build Second)

Test end-to-end flows with real-world scenarios.

For a customer support bot:

eval_set = [
    {
        "query": "I want to cancel my subscription",
        "expected_intent": "cancellation",
        "must_include": ["cancel", "refund policy"],
        "must_not_include": ["upgrade", "new features"]
    },
    {
        "query": "How do I upgrade to premium?",
        "expected_intent": "upgrade",
        "must_include": ["premium", "pricing"],
        "must_not_include": ["cancel", "refund"]
    }
]

def test_end_to_end():
    for test in eval_set:
        response = chatbot.respond(test["query"])
        
        # Check intent classification
        assert response.intent == test["expected_intent"]
        
        # Check content requirements
        for phrase in test["must_include"]:
            assert phrase.lower() in response.text.lower()

Coverage needed: 100-500 test cases covering user journeys.


Level 3: Human Evals (Build Third)

Systematic human review for subjective quality.

Rating rubric example:

Accuracy (1-5):
1 = Factually incorrect
3 = Mostly correct, minor issues  
5 = Completely accurate

Helpfulness (1-5):
1 = Doesn't address the question
3 = Partially addresses the question
5 = Fully addresses with actionable info

Tone (1-5):
1 = Inappropriate or confusing
3 = Acceptable but generic
5 = Professional and on-brand

Sample size: 50-100 responses per evaluation cycle, rotated through multiple reviewers.


Level 4: Production Evals (Build Last)

Real user feedback from production traffic.

Metrics to track:

  • 👍/👎 explicit feedback rate
  • Conversation abandonment rate
  • Escalation to human rate
  • Time to resolution
  • Repeat query rate (same user, same topic)

Implementation:

@app.post("/feedback")
def log_feedback(conversation_id: str, helpful: bool):
    log_metric({
        "conversation_id": conversation_id,
        "helpful": helpful,
        "timestamp": datetime.now(),
        "response_version": MODEL_VERSION
    })

The Eval-First Development Cycle

  1. Define success criteria → What does "working" mean?
  2. Build eval harness → How will you measure it?
  3. Create baseline → How does the current system perform?
  4. Build feature → Implement the change
  5. Run evals → Does it improve metrics?
  6. Ship or iterate → Evidence-based decision

Common Objections

"We don't have time to build evals first"

You don't have time not to. Without evals, you'll spend 10x more time debugging, reverting, and apologizing to users.

"AI is too unpredictable to test"

AI outputs are stochastic, but their quality can be measured. Use fuzzy matching, semantic similarity, or LLM-as-judge techniques.

"We'll add evals after we ship"

You won't. Once the pressure of shipping passes, evals become "technical debt" that never gets paid. Build them first.

"Our use case is too subjective"

Even subjective quality can be measured. Use human evaluation with clear rubrics. If humans can't agree on quality, your requirements aren't clear enough.

Building Your First Eval Harness

Minimum Viable Eval (Start Here)

# eval_harness.py
import json
from datetime import datetime

class SimpleEval:
    def __init__(self, eval_set_path):
        with open(eval_set_path) as f:
            self.eval_set = json.load(f)
    
    def run(self, system_under_test):
        results = []
        for test_case in self.eval_set:
            response = system_under_test(test_case["input"])
            passed = self.check(response, test_case["expected"])
            results.append({
                "input": test_case["input"],
                "expected": test_case["expected"],
                "actual": response,
                "passed": passed
            })
        
        return {
            "timestamp": datetime.now().isoformat(),
            "total": len(results),
            "passed": sum(1 for r in results if r["passed"]),
            "failed": sum(1 for r in results if not r["passed"]),
            "details": results
        }
    
    def check(self, actual, expected):
        # Override this for your use case
        return expected.lower() in actual.lower()

Eval Set Format

[
    {
        "input": "What are your business hours?",
        "expected": "9 AM to 5 PM",
        "category": "faq",
        "priority": "high"
    },
    {
        "input": "Do you ship internationally?",
        "expected": "international shipping",
        "category": "faq",
        "priority": "medium"
    }
]

The Payoff

Teams that adopt Eval-First:

  • Ship faster — Automated testing eliminates manual QA bottleneck
  • Ship safer — Regressions caught before users see them
  • Iterate smarter — Data-driven prompt optimization
  • Scale confidently — Evidence that the system works

Teams that don't:

  • Spend 80% of time debugging production issues
  • Lose user trust through inconsistent behavior
  • Can't explain system performance to stakeholders
  • Eventually rewrite everything from scratch

Conclusion

The Eval-First Principle isn't about having perfect tests—it's about having any systematic measurement from day one.

The question isn't "should we build evals?" It's "can we afford not to?"


The best AI teams I've worked with spend 30% of their time on evaluation infrastructure. The worst spend 0% on evals and 80% fighting fires.

What's your eval strategy?

AM

Abhinav Mahajan

AI Product & Engineering Leader

Building AI systems that work in production. These frameworks come from real experience shipping enterprise AI products.

💡 Apply This Framework

Find This Framework Useful?

I'd love to hear how you've applied it or discuss related ideas. Let's explore how these principles apply to your specific context.