GenAI Disciplines Bytes Docs About

Account

Reading MaterialRead

GenAI Evaluation, Safety & Governance/Evaluation Dataset Curation

Hands-on LabsLabs

Knowledge Check

50 questions · Test your understanding before moving on

Question 1 of 50

Select one answer

Your evaluation team reports that their LLM benchmark consistently shows 95% accuracy, yet production users complain about poor reasoning capabilities. You discover the dataset contains 80% summarization tasks and only 5% reasoning tasks. What is the most effective structural fix to make the evaluation scores reflect true model capability?