← Back

PROMPT EVAL · 101

Prompt Eval 101

Evaluate an AI prompt in 30 seconds — classify support tickets with Claude, then use an LLM judge to grade the reasoning.

Based on anthropic.skilljar.com/claude-with-the-anthropic-api →

2026-04-20 0:45 PythonStreamlitClaudeEvaluation
Watch full-length version →

A tiny end-to-end demo showing how to evaluate a classifier prompt:

  1. Classify customer-support tickets into one of 5 categories using Claude.
  2. Grade the reasoning with a second Claude call acting as judge (1–5 score).
  3. Spot a failure — ticket #8 is intentionally misclassified so you can see what a bad row looks like.

Mock mode runs offline with pre-computed outputs, so you can play with the UI before plugging in an API key.

Run locally

pip install -r requirements.txt
streamlit run app.py                     # mock mode
ANTHROPIC_API_KEY=sk-ant-… streamlit run app.py   # live mode

Source