Free lesson
Score AI use cases with weighted multi-criteria evaluation
You build a UseCaseScoringEngine that uses Instructor + Pydantic to extract typed feasibility/impact/data-readiness scores and ranks use cases via configurable weights.
~25 min read · Free to read — no subscription required.
Usecase prioritization
Introduction
When you're running a client discovery workshop, comparing twenty AI use cases by intuition alone leads to missed opportunities and misprioritized investments. Organizations need a repeatable, objective method to score each candidate across dimensions that actually predict production success — not just executive enthusiasm or engineering novelty. By the end of this lesson, you'll be able to build a weighted scoring engine that evaluates each use case across business impact, technical feasibility, data readiness, and time-to-value, producing a ranked shortlist backed by structured LLM outputs your clients can act on immediately.
Key Terminology
- Weighted Scoring Engine — a system that evaluates each AI use case across named dimensions, multiplies each dimension's score by a configured weight, and sums the products into a single
weighted_totalfloat that enables objective, comparable ranking across all candidates surfaced in a discovery workshop. - CriterionScore — a Pydantic model representing one evaluation dimension's output: a bounded numeric
score(0–10), aconfidencevalue (0–1), and a free-textreasoningstring explaining the rating; fourCriterionScoreentries — one per dimension — are collected inside eachUseCaseEvaluation. - UseCaseEvaluation — the top-level Pydantic schema the scoring engine returns for a single candidate, containing the list of
CriterionScoreentries, aweighted_totalfloat, arecommendationstring, and ariskslist — the complete structured verdict ready for immediate ranking. - ScoringCriteria — a Pydantic model holding the four configurable dimension weights (
business_impact_weight,technical_feasibility_weight,data_readiness_weight,time_to_value_weight); the defaults (0.35 / 0.25 / 0.25 / 0.15) favor business impact while treating technical feasibility and data readiness symmetrically. - Instructor — a Python library that patches a standard OpenAI client via
instructor.from_openai(), adding schema enforcement tochat.completions.create()so the LLM's response is validated against aresponse_modelPydantic class and retried automatically on malformed output — returning a typed Python object instead of raw text.
Concepts
Scoring as an Engineering Discipline
Discovery workshops routinely surface fifteen to thirty AI use case candidates in a single session. Evaluating them by intuition — "this one seems impactful," "that one sounds hard" — produces rankings that reflect whoever speaks loudest rather than what actually predicts production success. A weighted scoring engine converts that subjective exercise into a repeatable calculation: each use case is independently evaluated across a fixed set of dimensions, each dimension yields a bounded score and a confidence value, and a single weighted sum determines the final ranking.
The critical design insight is that the dimensions and their weights encode your beliefs about what makes an AI project succeed in production. Because those weights live in ScoringCriteria as plain Pydantic fields, calibrating them for a specific client — say, downweighting time-to-value for a company with a long investment horizon — requires editing one object, not rewriting evaluation logic.
The Four Dimensions and Why the Defaults Are Set as They Are
The engine evaluates every candidate across four orthogonal dimensions. Business impact (weight 0.35) captures delivered value — revenue, cost reduction, competitive advantage — and carries the highest weight because technical elegance without business value never reaches production. Technical feasibility (0.25) checks whether current model capabilities can meet the required accuracy and latency thresholds for the task type. Data readiness (0.25) evaluates whether the necessary data assets — volume, quality, accessibility, labels, regulatory clearance — are actually available; a high-impact idea built on nonexistent labeled data scores near zero here. Time-to-value (0.15) estimates how quickly the project produces measurable results, acting as a tiebreaker that nudges the shortlist toward early wins that build organizational confidence.
The default weights treat feasibility and data readiness symmetrically because a common failure mode is to prioritize technically easy problems over data-ready ones, which systematically underestimates data pipeline costs and prep timelines. Keeping them equal forces the scoring engine — and the workshop participants reading its output — to confront both dimensions together.
Pydantic + Instructor as a Validation Gate
Without schema enforcement, an LLM-backed scoring engine is unreliable: the model might express a score as a sentence, omit a dimension entirely, or format confidence as a percentage string instead of a 0–1 float. The solution is to define the expected output as a Pydantic schema — CriterionScore capturing one dimension, UseCaseEvaluation aggregating the full verdict — and use Instructor to make the OpenAI client enforce it.
instructor.from_openai() patches the standard client so that passing response_model=UseCaseEvaluation to chat.completions.create() acts as a post-generation validator: if the model's JSON does not parse into a valid UseCaseEvaluation instance, Instructor retries automatically before surfacing an error. The payoff is a fully typed Python object you can immediately sort alongside other candidates and serialize to a client report — no regex, no string parsing, no defensive null-checks on fields that freeform text might or might not include (see Code Walkthrough).
Code Walkthrough
Now that you understand the four weighted scoring dimensions, let's implement the engine that applies them systematically to each use case a workshop surfaces.
The pipeline has two parts: a set of Pydantic models that define the evaluation schema, and an Instructor-patched OpenAI client that forces the LLM to return a validated instance of that schema rather than freeform text.
CriterionScore captures a single dimension's output — a bounded score (0–10), a confidence value (0–1), and a free-text reasoning field. UseCaseEvaluation aggregates these criterion scores into a weighted_total, a recommendation, and a list of risks. ScoringCriteria holds the configurable weights; the defaults (0.35 / 0.25 / 0.25 / 0.15) favor business impact while keeping technical feasibility and data readiness equally weighted, with time-to-value as the lightest factor.
Code snippetpython
1from pydantic import BaseModel, Field 2from typing import List 3 4class CriterionScore(BaseModel): 5 criterion: str 6 score: float = Field(ge=0, le=10) 7 confidence: float = Field(ge=0, le=1) 8 reasoning: str 9 10class UseCaseEvaluation(BaseModel): 11 use_case_name: str 12 scores: List[CriterionScore] 13 weighted_total: float 14 recommendation: str 15 risks: List[str] 16 17class ScoringCriteria(BaseModel): 18 business_impact_weight: float = 0.35 19 technical_feasibility_weight: float = 0.25 20 data_readiness_weight: float = 0.25 21 time_to_value_weight: float = 0.15
With the schema in place, instructor.from_openai() patches the standard OpenAI client so that chat.completions.create() validates the model's response against response_model=UseCaseEvaluation before returning. If the model produces malformed output, Instructor retries automatically. The result is a typed UseCaseEvaluation object you can immediately sort alongside other candidates — no manual text parsing required.
Code snippetpython
1import instructor 2import openai 3 4proxy_url = "http://openai-proxy:8080" 5scoring_prompt = ( 6 "Evaluate this AI use case across Business Impact, Technical Feasibility, " 7 "Data Readiness, and Time-to-Value. Score each dimension 0–10 with reasoning." 8) 9use_case_description = ( 10 "Automate invoice extraction from scanned PDFs using a vision model." 11) 12 13client = instructor.from_openai( 14 openai.OpenAI(api_key="student-token", base_url=proxy_url) 15) 16 17evaluation = client.chat.completions.create( 18 model="gpt-4o", 19 response_model=UseCaseEvaluation, 20 messages=[ 21 {"role": "system", "content": scoring_prompt}, 22 {"role": "user", "content": use_case_description}, 23 ], 24) 25 26print(evaluation.use_case_name, evaluation.weighted_total, evaluation.recommendation) 27for s in evaluation.scores: 28 print(f" {s.criterion}: {s.score}/10 (confidence {s.confidence:.2f}) — {s.reasoning}")
Verify by running the script and confirming that evaluation.scores contains four CriterionScore entries — one per weighted dimension — and that evaluation.weighted_total is a float between 0 and 10.
Do's and Don'ts
Having walked through the material above, the following Do's and Don'ts distill it into practice.
Do's
- ✓Do define
CriterionScoreandUseCaseEvaluationas Pydantic models with boundedFieldconstraints — thege=0, le=10bound onscoreandge=0, le=1onconfidenceenforce semantic validity at parse time, so a hallucinated score of 15 or a confidence of 2.0 raises a validation error before the bad value reaches your ranking logic. - ✓Do patch the OpenAI client with
instructor.from_openai()and passresponse_model=UseCaseEvaluation— this forces the LLM to return a structured, typed instance rather than freeform text, eliminating manual JSON parsing and giving Instructor the retry hook it needs to recover from malformed model output automatically. - ✓Do tune the
ScoringCriteriaweights deliberately before running a workshop — the defaults (0.35 / 0.25 / 0.25 / 0.15) embed an explicit prior that business impact outweighs time-to-value; changing them without a client rationale silently shifts the ranked shortlist and undermines the repeatability the engine is designed to provide.
Don'ts
- ✗Don't omit the
response_modelargument and parse the LLM's text response manually — doing so bypasses Instructor's validation-and-retry loop, meaning a single malformed completion silently produces an incompleteUseCaseEvaluationwith missingscoresentries or aweighted_totalofNone, corrupting the ranked output without raising an error. - ✗Don't let
weighted_totalbe computed client-side by summing rawCriterionScore.scorevalues without applying theScoringCriteriaweights — ignoring the 0.35 / 0.25 / 0.25 / 0.15 multipliers collapses all four dimensions to equal importance, defeating the model's ability to surface use cases with outsized business impact over technically easy but low-value candidates. - ✗Don't reuse a single
UseCaseEvaluationinstance across multiple use case descriptions by mutating its fields — each call toclient.chat.completions.create()must receive a freshuse_case_descriptionand return a newUseCaseEvaluationobject; reusing or copying evaluation objects across candidates corrupts theriskslist andrecommendationwith data from a prior use case, producing a misleading shortlist.
Keep going with Forward Deployed GenAI Engineering
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.