PROMPT EVAL · 101
Prompt Eval 101
Evaluate an AI prompt in 30 seconds — classify support tickets with Claude, then use an LLM judge to grade the reasoning.
Based on anthropic.skilljar.com/claude-with-the-anthropic-api →
A tiny end-to-end demo showing how to evaluate a classifier prompt:
- Classify customer-support tickets into one of 5 categories using Claude.
- Grade the reasoning with a second Claude call acting as judge (1–5 score).
- Spot a failure — ticket #8 is intentionally misclassified so you can see what a bad row looks like.
Mock mode runs offline with pre-computed outputs, so you can play with the UI before plugging in an API key.
Run locally
pip install -r requirements.txt
streamlit run app.py # mock mode
ANTHROPIC_API_KEY=sk-ant-… streamlit run app.py # live mode