OpenAI Unveils o3 and o4-mini Models, Claims o3 Can ‘Generate Novel Hypotheses’

In December 2024, OpenAI announced o3, its most advanced reasoning AI model, promising a release after thorough safety testing. Now, after four months, the full o3 model has officially launched, alongside the next-generation o4-mini and o4-mini-high reasoning models.

Over the past four months, OpenAI has further enhanced the o3 model, calling it their “most powerful reasoning model” yet. Both o3 and o4-mini can use multiple agentic tools within ChatGPT like web search and Python tools and can now analyze images. They’re trained to select the best tools based on the task at hand.

OpenAI states that o3 sets a new standard in coding, math, science, and visual tasks like analyzing images, charts, and graphics. Early testers report that o3 can “generate and critically evaluate novel hypotheses, especially in biology, math, and engineering.“

o3-and-o4-mini-benchmark-scores-azmotech

The new o4-mini is a smaller, faster, and more cost-efficient model that excels in math, coding, and visual tasks. It even scored 99.5% on AIME 2025 with Python interpreter access.

Both o3 and o4-mini have nearly maxed out AIME 2024 and 2025 benchmarks. On GPQA Diamond, o3 scores 83.3 while o4-mini scores 81.4. On Humanity’s Last Exam, o3 scores 20.32 without tools and 24.9 with tools. For SWE-Bench Verified, o3 achieves 69.1%, outperforming Google’s Gemini 2.5 Pro at 63.8%.

o3-and-o4-mini-multimodal-and-coding-benchmarks-azmotech

On multimodal benchmarks, both o3 and o4-mini perform competitively, achieving high accuracy in MMMU, MathVista, and CharXiv-Reasoning.

Additionally, OpenAI has launched Codex, a new command-line agentic tool similar to Anthropic’s Claude Code. It runs from your terminal and leverages multimodal reasoning powered by o3 and o4-mini.

On multimodal benchmarks, both o3 and o4-mini perform competitively, achieving high accuracy in MMMU, MathVista, and CharXiv-Reasoning.

Additionally, OpenAI has launched Codex, a new command-line agentic tool similar to Anthropic’s Claude Code. It runs from your terminal and leverages multimodal reasoning powered by o3 and o4-mini.

OpenAI o3 and o4-mini Availability

Starting today, o3 and o4-mini are rolling out to ChatGPT Plus, Pro, and Team users, replacing the older o1, o3-mini, and o3-mini-high models. ChatGPT Enterprise and Edu users will get access within a week. Good news for free-tier users: o4-mini will be available via the ‘Think’ button.

OpenAI also announced that o3-pro with full tool support is coming in a few weeks, while Pro users can keep using the o1-pro model in the meantime.

OpenAI o3 is a Powerful Reasoning Model

In case you missed it, OpenAI’s o3 reasoning model made headlines in 2024 by becoming the first to crack the ARC-AGI benchmark, achieving an impressive 87.5% score on the ARC-AGI Semi-Private Evaluation set using a high-compute setup. François Chollet, the ARC-AGI creator, highlighted this achievement in a blog post:

This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.

However, it was also disclosed that o3 was trained on 75% of the ARC-AGI Public Training set, sparking debate over whether its performance stems from true generalized intelligence or benchmark-specific tuning.

Still, a recent report from The Information highlights o3’s ability to integrate knowledge across multiple disciplines comparable to Nikola Tesla. It can generate novel scientific ideas and design experiments in fields like nuclear fusion and pathogen detection. OpenAI reportedly considers its capabilities advanced enough to warrant a $20,000-per-month pricing tier, dubbing it a “PhD-level AI.”

OpenAI Unveils o3 and o4-mini Models, Claims o3 Can ‘Generate Novel Hypotheses’

OpenAI o3 and o4-mini Availability

OpenAI o3 is a Powerful Reasoning Model

Share this article

Leave a Reply Cancel reply

Read next

WhatsApp Begins Rolling Out Usernames, Eliminating the Need to Share Phone Numbers

Samsung Prepares Wide Fold To Challenge Apple’s 2026 Foldable

Apple’s Foldable IPhone Could Replace Face ID With Touch ID

Lenovo Idea Tab Plus Debuts With 90Hz Screen And 10,200mAh Battery