Free lesson
Build the API and interface layer for Team Structure for AI
Create the API endpoints, CLI commands, or UI components that expose the functionality built for design team topologies including platform team, agent team, ml team, and hybrid structures.
~25 min read · Free to read — no subscription required.
Ml team
Introduction
Teams that scale GenAI without a dedicated ML team end up with every product squad picking models in isolation — duplicated evals, inconsistent quality, and serving cost no one can predict. This lesson covers how to structure an ML/model team that sits between research and production engineering. You'll learn how to balance research and production orientations, define a model lifecycle with explicit handoff gates, and centralize evaluation and experiment tracking so application teams can consume models as reliable building blocks.
Key Terminology
- ModelArtifact — A versioned model record carrying status, owner, and eval scores (experimental, staging, production, deprecated); the unit the registry tracks as a model moves through the lifecycle.
- MLTeamMember — A role plus orientation field (research or production); surfacing orientation makes the team's research-to-production balance auditable instead of accidental.
- HandoffProtocol — The explicit gate criteria and named responsible role for promoting a model between lifecycle stages; replaces informal "seems good enough" promotion with verifiable checks.
- Model card — A standardized document for each registered model covering architecture, training data, evaluation scores, recommended use cases, and known failure modes, so application teams adopt a model with eyes open.
Concepts
The ML team's core job is to own the model lifecycle on behalf of the rest of the organization: selecting models, fine-tuning when justified, evaluating them against shared benchmarks, and promoting them through experimental → staging → production stages via explicit handoff gates. Internally, the team must balance research and production orientations so exploration does not crowd out stable artifacts and vice versa. Externally, it provides three shared services — model registry, evaluation framework, and experiment tracking — that turn individual researcher knowledge into organizational knowledge and prevent application teams from independently re-deriving model decisions.
The model lifecycle the ML team owns, and the shared services it provides, look like this:
Code Walkthrough
Building on the previous section, here is how ModelArtifact, MLTeamMember, and HandoffProtocol fit together as a minimal model-lifecycle harness — three dataclasses that make the team's structure and the lifecycle gates auditable in code.
Code snippetpython
1from dataclasses import dataclass, field 2 3@dataclass 4class ModelArtifact: 5 name: str 6 status: str # experimental | staging | production | deprecated 7 owner: str 8 eval_scores: dict[str, float] = field(default_factory=dict) 9 10@dataclass 11class MLTeamMember: 12 role: str 13 orientation: str # research | production 14 15@dataclass 16class HandoffProtocol: 17 from_stage: str 18 to_stage: str 19 gate_criteria: list[str] 20 responsible_role: str 21 22staging_to_prod = HandoffProtocol( 23 from_stage="staging", 24 to_stage="production", 25 gate_criteria=[ 26 "p99 latency within SLO", 27 "safety evaluation passes all guardrails", 28 "rollback procedure documented and tested", 29 ], 30 responsible_role="ML Production Engineer", 31)
You'll know it works when every ModelArtifact in your registry has an owner, every team member's orientation is recorded, and a model can only advance stages by satisfying a matching HandoffProtocol.gate_criteria list.
Do's and Don'ts
Now that you have worked through the implementation, the practices below separate a durable approach from a fragile one.
Do's
- Define explicit handoff protocols with named gate criteria for every transition between experimental, staging, and production stages.
- Track the research-to-production staffing ratio and rebalance it as the organization matures, typically skewing toward production over time.
- Centralize evaluation frameworks and experiment tracking in the ML team so application teams share a consistent quality bar and model decisions are reproducible.
Don'ts
- ✗Don't let models reach production through informal agreement ("it seems good enough") instead of verified gate criteria.
- ✗Don't allow each application team to build its own evaluation methodology, since divergent benchmarks make cross-use-case quality comparison impossible.
- ✗Don't let researcher experiments live only in personal notebooks; untracked experiments cannot be answered as organizational questions later.
Keep going with GenAI Engineering Leader
Create a free account to track your progress and open this lesson in the full learning view. Subscribe to unlock the entire path — every goal, the hands-on labs, quizzes, and your verifiable skill graph — from . Cancel anytime.