

Artificial Intelligence is rapidly advancing, achieving over 90% scores in benchmarks like Massive Multitask Language Understanding (MMLU), once considered highly complex. However, researchers now believe that such traditional benchmarks are no longer sufficient to measure true AI capability. To address this, they introduced a far more challenging test called the Humanities Last Exam (HLE), designed to push AI systems to their limits.
The HLE consists of 2,500 highly complex questions created by nearly 1,000 experts from 500 organizations across 50 countries. Developed by researchers from the Center for AI Safety and Scale AI, the test aims to uncover weaknesses in AI models and evaluate their real-world reasoning abilities. The questions are so difficult that even experts disagree on answers, with a disagreement rate of 15.4% to 18%, making it nearly impossible for any single human to answer all questions correctly.
Initially, top AI models scored less than 10% in the HLE, but their performance is improving rapidly. A key issue identified is the “calibration error,” where AI confidently presents incorrect answers as correct. Experts warn that this flaw could be dangerous in critical fields like healthcare and finance. To keep pace with AI progress, researchers have introduced a dynamic testing approach called HLE Rolling, emphasizing that high scores do not indicate true general intelligence but only academic proficiency.


.jpeg&w=3840&q=75)


















.avif&w=3840&q=75)
Comments (0)
No comments yet
Be the first to comment!