The news you shouldn't get excited about

O3 achieves 87.5% on AGI. And it means nothing.

Dec 24, 2024

The news:

OpenAI's ChatGPT O3 scores 87.5% on AGI test

The news, in detail (that you are likely to read):

TLDR: AI (LLMs of today) almost touched the holy grail of human-level intelligence - AGI.

ARCPrize is a $1M prize distributed to AI developers that can solve its AGI (Artificial general intelligence ~) set of challenges. ARC stands for Abstract Reasoning Corpus.

OpenAI's O3 scored 87.5% accuracy on high-compute, semi-private evaluation set.

Why ARCPrize was announced:

It’s a community effort to track the progress towards AGI, since 2020.

Those who don’t need a simplified version of the story can directly read from the source.

AI toys of today (specifically, LLMs) are touted as having Ph.D.-level intelligence, but they can't solve simple brain teasers (see the below image) unless they are trained specifically on puzzle datasets.

ARCPrize contains 400 visual brain teaser challenges (see the below image) that a program must solve (evaluation challenges). ARC recommends that participants can’t use them to train the models if a training approach is to be taken.
Additionally, there are 400 public challenges that can be used to train an AI model (training challenges).
To be uniquely called AGI-equivalent, a program must also solve 100 private (unseen) evaluation challenges.

A past NYU study shows that 790 out of those 800 public challenges (98.7%) are solvable by at least one typical crowd-worker.

The Research:

François Chollet (ex-Google, creator of Keras, an open-source deep learning library adopted by over 2.5M developers) published a paper in 2020 that built upon what we call AGI. The paper was titled "On the measure of intelligence".

In that, he claimed that exhibiting a learned skill isn't a measure of intelligence. Rather, how one transfers that skill to a related but different situation is.

If you know how to drive a bicycle, no one praises your driving ability every time you ride one. However, when you apply your bicycle driving skill in driving a scooter for the first time without anyone's assistance, you deserve a pat on the back.

(Thought credit: Autonomous vehicles that devour a million traffic sign images as their training data, yet suffer from accidents)

ARCPrize was established to explore computational approaches to attain pattern recognition and abstraction that humans are capable of.

Instead, it ended up provoking yet another debate on the raging topic: Does the path to AGI go via the noisy crowd of AI street hawkers?

Why is O3's achievement nowhere near AGI?

👉 Chollet is quite clear about the expectations: The ARC challenge is far from the measure of true AGI. It’s just a starting point in our AGI journey.

👉 The 87.5% achievement of O3 was obtained when it was fine-tuned on the semi-private evaluation data.

We don’t know what this means, but I assume semi-private means it could have been known to OpenAI beforehand.

For this result, OpenAI hasn’t revealed its cost, but it used 5.7B tokens to achieve it.

On a $10K cost limit, it spent close to $8600 on a low-efficiency computing solution to solve 500 tasks, 400 of which were publicly available already.

The average human (Amazon Mechanical Turk) does this for 1000x less cost.

👉 The test was conducted on the tasks already solved by two human workers with 97%-98% accuracy.

👉 To be eligible to win the prize, the winner must open-source the code solution.

Winners are on the ARCPrize website. Their solutions are available on Kaggle.

Team Architects claimed the top prize with $25k with 53.5% accuracy.

Wen-Ding Li’s team claimed the best research paper prize of $50K. Their publication: Combining Induction and Transduction for Abstract Reasoning

As of this writing, OpenAI hasn’t submitted any solution.

👉 To win the grand prize, the program must solve 85% of the evaluation problems.

The challenge's grand prize of $6,00,000 (85% on private evaluation data ~ test inputs that aren't known to competitors) is still unclaimed.

To drive home the point:

It's easy to gape at the news that says:

GPT and its brothers passed state exams with 70-80-90%.

What this awe misses is:

👉 LLMs run on machines. They compute. When we associate intelligence with them, they are bound to be accurate. Even 100% accuracy isn't cleverness.

👉 LLMs are pretrained on the very questions they are tested on. In any examination setup, humans are punished for doing so and are forced to spend their efforts on a lot more information than the examinations.

👉 High IQ humans suffer from mental fatigue after hours. They face memory limitations that machines don’t. Machines' "fatigue" moment arrives much later (chip overheating happens in years).

👉 Humans also suffer from psychological biases and emotional challenges that machines never have to cope with (unless they are infused with such training data).

👉 Yet, we see humans often achieving 100% benchmarks in tests that present completely unseen challenges. We don't even call them geniuses, just above average.

The whole idea is less about data science, AI, and machine learning, and more about how humans and machines are fundamentally different in solving problems, and where traditional AI solutions fall short of not only achieving human-like performance but also heading in the direction of human cognition.

What if LLMs surpass 85% in ARC?

Building upon Cholet’s disclaimer (ARC challenge not being the true measure of AGI), I would go one step further:

Even if an LLM wins the challenge by surpassing the 85% benchmark, I wouldn’t call it AGI equivalent. And I am not talking about the score requirement here.

https://substack.com/home/post/p-153510923

AI expert

Melanie Mitchell

has written a wonderful post about approaches candidates have taken to solve the ARC challenges. Most ARC competitors have relied upon LLMs to generate programs that solve the challenges, and then submit the ones with the best outcomes.

By Cholet’s assumption, GPT o3 may have attempted a “deep learning-guided program search” based on the NLP-interpreted challenge.

If we decide a winner based on this outcome, it is akin to attributing our best inventions to our ancestors, who produced a billion distinct humans inheriting the best and the worst of their genetic footprint. A very small percentage of them happened to inherit what we call a genius. They ended up discovering gravity, relativity, and quantum physics. Would we call Hermann Einstein a true double Nobel winner, instead of his legendary son?

As I also said in my post about AI-content detection needing a new web standard, the outcome measurement is no longer viable when the underlying process to generate the output changes. We need newer benchmarks that measure the process itself.

Directionally, we should be going the opposite of where we are headed: If the process by which LLMs (or any AI program) achieve a human benchmark is 100% akin to human cognition, that AI is true AGI.

It’s hard, not impossible. But I don’t see it being pursued shortly, because:

It puts deconstructing human cognition first, which is hard.
It will invalidate a lot of snake-oil AI, which is too easy to measure in the outcome-measurement-based world.

Learning for Life

Discussion about this post