50 questions · Test your understanding before moving on
Question 1 of 50
Select one answer
Your evaluation team reports that their LLM benchmark consistently shows 95% accuracy, yet production users complain about poor reasoning capabilities. You discover the dataset contains 80% summarization tasks and only 5% reasoning tasks. What is the most effective structural fix to make the evaluation scores reflect true model capability?