Free lesson

Score AI use cases with weighted multi-criteria evaluation

You build a UseCaseScoringEngine that uses Instructor + Pydantic to extract typed feasibility/impact/data-readiness scores and ranks use cases via configurable weights.

~25 min read · Free to read — no subscription required.

Usecase prioritization

Introduction

When you're running a client discovery workshop, comparing twenty AI use cases by intuition alone leads to missed opportunities and misprioritized investments. Organizations need a repeatable, objective method to score each candidate across dimensions that actually predict production success — not just executive enthusiasm or engineering novelty. By the end of this lesson, you'll be able to build a weighted scoring engine that evaluates each use case across business impact, technical feasibility, data readiness, and time-to-value, producing a ranked shortlist backed by structured LLM outputs your clients can act on immediately.

Key Terminology

  • Weighted Scoring Engine — a system that evaluates each AI use case across named dimensions, multiplies each dimension's score by a configured weight, and sums the products into a single weighted_total float that enables objective, comparable ranking across all candidates surfaced in a discovery workshop.
  • CriterionScore — a Pydantic model representing one evaluation dimension's output: a bounded numeric score (0–10), a confidence value (0–1), and a free-text reasoning string explaining the rating; four CriterionScore entries — one per dimension — are collected inside each UseCaseEvaluation.
  • UseCaseEvaluation — the top-level Pydantic schema the scoring engine returns for a single candidate, containing the list of CriterionScore entries, a weighted_total float, a recommendation string, and a risks list — the complete structured verdict ready for immediate ranking.
  • ScoringCriteria — a Pydantic model holding the four configurable dimension weights (business_impact_weight, technical_feasibility_weight, data_readiness_weight, time_to_value_weight); the defaults (0.35 / 0.25 / 0.25 / 0.15) favor business impact while treating technical feasibility and data readiness symmetrically.
  • Instructor — a Python library that patches a standard OpenAI client via instructor.from_openai(), adding schema enforcement to chat.completions.create() so the LLM's response is validated against a response_model Pydantic class and retried automatically on malformed output — returning a typed Python object instead of raw text.

Concepts

Loading diagram...

Scoring as an Engineering Discipline

Discovery workshops routinely surface fifteen to thirty AI use case candidates in a single session. Evaluating them by intuition — "this one seems impactful," "that one sounds hard" — produces rankings that reflect whoever speaks loudest rather than what actually predicts production success. A weighted scoring engine converts that subjective exercise into a repeatable calculation: each use case is independently evaluated across a fixed set of dimensions, each dimension yields a bounded score and a confidence value, and a single weighted sum determines the final ranking.

The critical design insight is that the dimensions and their weights encode your beliefs about what makes an AI project succeed in production. Because those weights live in ScoringCriteria as plain Pydantic fields, calibrating them for a specific client — say, downweighting time-to-value for a company with a long investment horizon — requires editing one object, not rewriting evaluation logic.

The Four Dimensions and Why the Defaults Are Set as They Are

The engine evaluates every candidate across four orthogonal dimensions. Business impact (weight 0.35) captures delivered value — revenue, cost reduction, competitive advantage — and carries the highest weight because technical elegance without business value never reaches production. Technical feasibility (0.25) checks whether current model capabilities can meet the required accuracy and latency thresholds for the task type. Data readiness (0.25) evaluates whether the necessary data assets — volume, quality, accessibility, labels, regulatory clearance — are actually available; a high-impact idea built on nonexistent labeled data scores near zero here. Time-to-value (0.15) estimates how quickly the project produces measurable results, acting as a tiebreaker that nudges the shortlist toward early wins that build organizational confidence.

The default weights treat feasibility and data readiness symmetrically because a common failure mode is to prioritize technically easy problems over data-ready ones, which systematically underestimates data pipeline costs and prep timelines. Keeping them equal forces the scoring engine — and the workshop participants reading its output — to confront both dimensions together.

Pydantic + Instructor as a Validation Gate

Without schema enforcement, an LLM-backed scoring engine is unreliable: the model might express a score as a sentence, omit a dimension entirely, or format confidence as a percentage string instead of a 0–1 float. The solution is to define the expected output as a Pydantic schema — CriterionScore capturing one dimension, UseCaseEvaluation aggregating the full verdict — and use Instructor to make the OpenAI client enforce it.

instructor.from_openai() patches the standard client so that passing response_model=UseCaseEvaluation to chat.completions.create() acts as a post-generation validator: if the model's JSON does not parse into a valid UseCaseEvaluation instance, Instructor retries automatically before surfacing an error. The payoff is a fully typed Python object you can immediately sort alongside other candidates and serialize to a client report — no regex, no string parsing, no defensive null-checks on fields that freeform text might or might not include (see Code Walkthrough).

Code Walkthrough

Now that you understand the four weighted scoring dimensions, let's implement the engine that applies them systematically to each use case a workshop surfaces.

The pipeline has two parts: a set of Pydantic models that define the evaluation schema, and an Instructor-patched OpenAI client that forces the LLM to return a validated instance of that schema rather than freeform text.

CriterionScore captures a single dimension's output — a bounded score (0–10), a confidence value (0–1), and a free-text reasoning field. UseCaseEvaluation aggregates these criterion scores into a weighted_total, a recommendation, and a list of risks. ScoringCriteria holds the configurable weights; the defaults (0.35 / 0.25 / 0.25 / 0.15) favor business impact while keeping technical feasibility and data readiness equally weighted, with time-to-value as the lightest factor.

Code snippetpython
1from pydantic import BaseModel, Field 2from typing import List 3 4class CriterionScore(BaseModel): 5 criterion: str 6 score: float = Field(ge=0, le=10) 7 confidence: float = Field(ge=0, le=1) 8 reasoning: str 9 10class UseCaseEvaluation(BaseModel): 11 use_case_name: str 12 scores: List[CriterionScore] 13 weighted_total: float 14 recommendation: str 15 risks: List[str] 16 17class ScoringCriteria(BaseModel): 18 business_impact_weight: float = 0.35 19 technical_feasibility_weight: float = 0.25 20 data_readiness_weight: float = 0.25 21 time_to_value_weight: float = 0.15

With the schema in place, instructor.from_openai() patches the standard OpenAI client so that chat.completions.create() validates the model's response against response_model=UseCaseEvaluation before returning. If the model produces malformed output, Instructor retries automatically. The result is a typed UseCaseEvaluation object you can immediately sort alongside other candidates — no manual text parsing required.

Code snippetpython
1import instructor 2import openai 3 4proxy_url = "http://openai-proxy:8080" 5scoring_prompt = ( 6 "Evaluate this AI use case across Business Impact, Technical Feasibility, " 7 "Data Readiness, and Time-to-Value. Score each dimension 0–10 with reasoning." 8) 9use_case_description = ( 10 "Automate invoice extraction from scanned PDFs using a vision model." 11) 12 13client = instructor.from_openai( 14 openai.OpenAI(api_key="student-token", base_url=proxy_url) 15) 16 17evaluation = client.chat.completions.create( 18 model="gpt-4o", 19 response_model=UseCaseEvaluation, 20 messages=[ 21 {"role": "system", "content": scoring_prompt}, 22 {"role": "user", "content": use_case_description}, 23 ], 24) 25 26print(evaluation.use_case_name, evaluation.weighted_total, evaluation.recommendation) 27for s in evaluation.scores: 28 print(f" {s.criterion}: {s.score}/10 (confidence {s.confidence:.2f}) — {s.reasoning}")

Verify by running the script and confirming that evaluation.scores contains four CriterionScore entries — one per weighted dimension — and that evaluation.weighted_total is a float between 0 and 10.

Do's and Don'ts

Having walked through the material above, the following Do's and Don'ts distill it into practice.

Do's

  1. Do define CriterionScore and UseCaseEvaluation as Pydantic models with bounded Field constraints — the ge=0, le=10 bound on score and ge=0, le=1 on confidence enforce semantic validity at parse time, so a hallucinated score of 15 or a confidence of 2.0 raises a validation error before the bad value reaches your ranking logic.
  2. Do patch the OpenAI client with instructor.from_openai() and pass response_model=UseCaseEvaluation — this forces the LLM to return a structured, typed instance rather than freeform text, eliminating manual JSON parsing and giving Instructor the retry hook it needs to recover from malformed model output automatically.
  3. Do tune the ScoringCriteria weights deliberately before running a workshop — the defaults (0.35 / 0.25 / 0.25 / 0.15) embed an explicit prior that business impact outweighs time-to-value; changing them without a client rationale silently shifts the ranked shortlist and undermines the repeatability the engine is designed to provide.

Don'ts

  1. Don't omit the response_model argument and parse the LLM's text response manually — doing so bypasses Instructor's validation-and-retry loop, meaning a single malformed completion silently produces an incomplete UseCaseEvaluation with missing scores entries or a weighted_total of None, corrupting the ranked output without raising an error.
  2. Don't let weighted_total be computed client-side by summing raw CriterionScore.score values without applying the ScoringCriteria weights — ignoring the 0.35 / 0.25 / 0.25 / 0.15 multipliers collapses all four dimensions to equal importance, defeating the model's ability to surface use cases with outsized business impact over technically easy but low-value candidates.
  3. Don't reuse a single UseCaseEvaluation instance across multiple use case descriptions by mutating its fields — each call to client.chat.completions.create() must receive a fresh use_case_description and return a new UseCaseEvaluation object; reusing or copying evaluation objects across candidates corrupts the risks list and recommendation with data from a prior use case, producing a misleading shortlist.

Keep going with Forward Deployed GenAI Engineering

Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.