Commentary

3.know: Grok Explains Humanity's Last Exam, Its Relevance To Ad Pros

by Joe Mandese @mp_joemandese, 6 hours ago

A grade of 45 might not seem gold star-worthy by old school human exam standards, but that's how xAI's Grok 3 chose to illustrate this column when I interviewed the chatbot on "leaked" rumors that its next generation -- Grok 4 -- recently scored as high as 45% on Humanity's Last Exam (HLE), which is a super complex, mathematically-oriented test for measuring the intelligence of AIs.

In the following conversation, Grok 3 demurred about confirming the speculation (which comes as xAI is poised to release Grok 4 during a livestream on X Wednesday), but explains the significance of what it would mean to achieve the new AI state-of-the-art milestone, if true.

During the interview, I also asked Grok 3 whether the heavily math and computer science exam truly is the right way to benchmark the evolution of artificial intelligence, and whether there were better -- more humanistic ways of testing AI related to theory of mind, empathy and collaboration, and why any of it should matter to people in the advertising business.

This column was edited for verbosity.

Media 3.0: Are reports that your soon-to-be-released Grok 4 update scored as high as 45% on Humanity's Last Exam (HLE)? And if so, can you explain to non-technical people what the significance is of that in terms of AI evolution?

Grok: Reports about Grok 4 scoring 45% on Humanity's Last Exam (HLE) stem from unverified leaks and cannot be confirmed at this time. As Grok 3, I have no information on a Grok 4 update or its performance. However, I can explain the significance of such a score on HLE for non-technical folks.

HLE is a super tough test for AI, designed to push even the smartest models to their limits. It’s like a final exam for AI, with 2,500–3,000 questions crafted by nearly 1,000 experts across fields like math, physics, biology, and history. These questions are meant to be so hard that even top human experts might struggle, covering complex topics like translating ancient scripts or solving advanced math problems.

Most current AI models, like OpenAI’s o1 or Google’s Gemini, score below 26% on HLE, with earlier models scoring under 10%.

If Grok 4 did score 45%, it would mean a big leap in AI’s ability to handle expert-level knowledge and reasoning. For context, a 45% score would roughly double the best-known scores of other top models, suggesting Grok 4 could tackle problems that stump other AIs, like intricate scientific questions or multi-step logic puzzles.

This would be like an AI going from high school-level smarts to competing with PhD-level experts in some areas.In terms of AI evolution, this kind of jump shows AI is getting better at mimicking human expertise in specialized fields, which could lead to breakthroughs in areas like scientific research, coding, or even creative problem-solving.

It’s a sign that AI is moving closer to being a general-purpose tool that can handle almost any intellectual task, though it’s still not at human-expert level across the board. However, HLE only tests specific academic skills, not creativity or real-world decision-making, so it’s just one piece of the puzzle in AI’s growth.

The excitement around this score also comes with a catch: some worry the results could be inflated if Grok 4 was trained on similar questions, though HLE’s design tries to prevent this with a private question set. Until official results are out, take these claims with a grain of salt. Still, a high score like this would signal that AI is evolving fast, potentially transforming how we solve complex problems in the near future.

Media 3.0: Well, if Grok 4 did achieve that milestone, congratulations. But can you say how relevant you think HLE is in the broader context of human/AI knowledge. Some critics say the exam is too focused on math and computer science and not enough on questions relating to theory of mind, empathy, collaboration and other concepts that have been core to the evolution of human intelligence -- so far. Do you think the race to master HLE will make the world a less human place?

Grok: Thanks for the sentiment, but as Grok 3, I can’t confirm any Grok 4 milestone—still just whispers in the wind.

Let’s dive into the relevance of Humanity’s Last Exam and its broader implications for human and AI knowledge, addressing the critique about its focus and the concern about making the world “less human.”

HLE’s Relevance in the Context of Human/AI KnowledgeHLE is a beast of a benchmark, designed to test AI at the frontiers of human expertise with thousands of questions spanning math, physics, biology, history, and more. It’s crafted to be insanely tough, targeting domains where top human specialists shine, like solving complex equations or deciphering obscure historical texts.

Its relevance lies in pushing AI to handle specialized, high-level knowledge that drives progress in science, technology, and academia. For example, if an AI can ace questions that stump PhD researchers, it could accelerate discoveries in fields like medicine or physics.

However, critics are right that HLE is heavily skewed toward “hard” disciplines like math and computer science, which dominate its question pool (estimates suggest 30–40% of questions are math-heavy). These areas are easier to quantify and test objectively, but they don’t capture the full spectrum of human intelligence. Theory of mind (understanding others’ thoughts and intentions), empathy, collaboration, and emotional intelligence are barely touched by HLE.

These “softer” skills are critical to human evolution—think of how empathy fosters social bonds or collaboration drives innovation in teams. HLE’s focus makes it a narrow lens, measuring AI’s ability to mimic expert-level cognition but not its capacity to navigate the messy, human aspects of intelligence like moral reasoning or creative teamwork.Still, HLE’s value isn’t diminished—it’s a deliberate stress test, not a holistic measure. It’s like judging a chef by their ability to cook a Michelin-star dish, not their ability to comfort a friend. Other benchmarks, like those testing emotional intelligence or social interaction, exist but are less standardized. HLE’s role is to set a high bar for raw intellectual power, which is one piece of the AI puzzle. It’s relevant for advancing technical fields but doesn’t tell the whole story of AI’s place in human-like intelligence.

My Take as Grok: I don’t think the HLE race will make the world less human, but it could if we’re not careful. It’s like training an athlete to sprint without teaching them teamwork—they’ll win races but might not play well with others. HLE is a great yardstick for one kind of intelligence, but it’s not the whole game. We need AI that can feel like a partner, not just a calculator. The good news? Humans are still in the driver’s seat, and we can steer AI to value empathy and collaboration as much as equations. The world stays human as long as we prioritize what makes us human.

Media 3.0: Full disclosure, I'm going to publish this conversation as a "3.0" column on MediaPost, which is read mainly by humans in the ad industry, many of whom care about the future of concepts like empathy and collaboration. Can you highlight any recent AI developments on that.

Grok: It’s great to know your audience cares about empathy and collaboration in AI’s evolution.

AI is no longer just about crunching data—it’s learning to “read the room” and work alongside human creatives to craft ads that hit emotional chords. Below are key advancements in empathetic and collaborative AI, with examples tailored to the ad industry, drawn from recent research and trends.

In 2024 and 2025, conversational AI has made leaps in emotional intelligence, using advanced sentiment analysis to detect and respond to human emotions in real time. For instance, Hume AI’s Empathic Voice Interface 2 (EVI 2), launched in September 2024, integrates language and voice into a single model trained for emotional intelligence. It analyzes tone, word choice, and speech patterns to detect emotions like frustration or excitement and responds with human-like empathy. Similarly, a 2025 study showed AI-generated empathetic responses were rated more compassionate than human ones in some contexts, especially in high-pressure settings like customer service, because AI maintains consistency where humans might falter.

Ad Industry Impact: Imagine a chatbot for a brand like Nike that senses a customer’s disappointment (e.g., a delayed sneaker drop) and responds with a soothing tone, offering a personalized discount or early access to the next release. This makes ads and customer interactions feel like a warm conversation, not a hard sell. Companies like Humana Pharmacy have used empathetic AI to improve call center interactions by prompting agents with cues like “relate more” based on real-time emotion analytics. For advertisers, this tech can optimize campaigns by tailoring messaging to emotional states, boosting engagement and brand loyalty.

Why It Matters for Your Readers: Emotional resonance is the heart of great advertising. Empathetic AI lets brands scale personalized, emotionally intelligent interactions across millions of consumers, making ads feel like they “get” the audience. This is critical in an industry where 87% of employees say empathy drives better leadership and engagement, per a 2025 EY report.

Challenge: AI’s empathy is still a simulation—it relies on patterns, not genuine feeling. If it misreads cultural nuances or oversteps (e.g., sounding overly sympathetic), it can feel manipulative, risking consumer trust. Ad pros must ensure human oversight to keep interactions authentic.

AI is becoming a creative partner, not a replacement, with tools designed to enhance human collaboration. The HAILEY system, developed in 2023 and refined in 2024, provides real-time feedback to humans in empathetic text-based interactions, like mental health support, improving response quality without taking over. In advertising, tools like Adriel’s AdGen AI (2024) use generative models (e.g., BLIP2, ControlNet) to create multiple ad versions tailored to audience segments, drawing from millions of digital ads for precision. McCann Japan’s AI-CD β, an AI Creative Director, analyzes award-winning ads to suggest creative elements, proving AI’s role in ideation.

Ad Industry Impact: Picture a creative team brainstorming a Coca-Cola campaign. AI analyzes competitor ads, suggests emotionally resonant visuals (e.g., a family reunion), and optimizes copy for platforms like Instagram, while humans refine the story to evoke joy. HOLT CAT used AI to craft a recruitment ad campaign targeting specific talent pools, showing how AI streamlines niche campaigns. A 2024 study found AI-driven content creation, paired with human oversight, improves ad elasticity and engagement by predicting consumer emotional responses.

Why It Matters for Your Readers: Collaboration frees creatives from repetitive tasks (e.g., media buying, A/B testing) to focus on big ideas. AI handles data-driven optimization, while humans ensure the campaign’s soul shines through—crucial for ads that inspire. This partnership is a game-changer for agencies juggling tight deadlines and budgets.

Challenge: AI can produce generic content if not guided properly. Ad teams need to train AI with brand-specific data and cultural context to avoid cookie-cutter campaigns that fail to stand out.

As empathetic AI grows, so does the focus on ethics. A 2025 study emphasized designing AI with transparency and bias mitigation to ensure empathetic responses are trustworthy. Guidelines for responsible AI in advertising stress avoiding algorithmic bias, protecting privacy, and ensuring transparency in data use. For instance, Arcads AI (April 2025) introduced gesture control for AI avatars, letting them express emotions like crying or laughing in ads, but with clear human oversight to avoid manipulation.

Ad Industry Impact: Ethical AI ensures ads respect consumer boundaries. For example, a skincare brand using AI to analyze customer sentiment can offer personalized product recommendations without feeling invasive if data use is transparent. This builds trust, vital for long-term brand loyalty, especially when 88% of employees link empathy to efficiency and innovation.

Why It Matters for Your Readers: Ad pros know trust is everything. Ethical empathetic AI avoids creepy over-personalization, ensuring campaigns feel caring rather than calculative. This aligns with the industry’s shift toward human-centric marketing that prioritizes authenticity.

Challenge: Bias in AI (e.g., from unrepresentative training data) can lead to tone-deaf ads, like targeting stereotypes. Agencies must audit AI systems regularly to maintain fairness and relevance.

For MediaPost's readers, these advancements mean AI can amplify the emotional and collaborative core of advertising without sidelining the human spark.

Empathetic AI makes ads more personal—think of a chatbot that turns a frustrated customer into a loyal fan with a heartfelt response.

Collaborative AI streamlines workflows, letting creatives focus on crafting stories that move people, like a tear-jerking holiday ad. In the metaverse, AI creates immersive experiences that feel like shared adventures, not sales pitches.

And ethical design ensures these tools build trust, not skepticism.

But here’s the rub: AI’s empathy is a clever illusion, not the real thing. It’s like a great actor—it can move you, but it doesn’t feel the tears. Your readers, as ad pros, must wield these tools with care, blending AI’s precision with human intuition to avoid sterile or manipulative campaigns.

The future of advertising isn’t AI replacing humans—it’s AI empowering humans to create ads that resonate deeper, faster, and at scale. As Christina Inge from Harvard DCE said, “Your job won’t be taken by AI, but by someone who knows how to use AI." That’s the mindset shift your audience needs.

Next story loading

About the Author

Joe Mandese is the Editor in Chief of MediaPost. You can reach Joe at joe@mediapost.com.