The Multi-Model Revolution: How Sakana AI Beat Claude 5

A Japanese AI System Reportedly Beat Claude 5 On Certain Benchmarks

June 23 2026
Devendra Prasad

The global race for artificial intelligence supremacy is changing fast. For years, it seemed like a high-stakes duel between Silicon Valley giants and heavily backed American startups. Names like OpenAI, Google, and Anthropic routinely dominate the headlines. Each new model generation promises more parameters and massive data clusters. They always focus on raw computing power.

However, a massive disruption has emerged from Tokyo. The spotlight has shifted to a lean, hyper-innovative player called Sakana AI.

In a major industry shakeup, Sakana AI officially launched a novel AI system called Fugu. Sakana AI did not try to train a single, trillion-parameter brain. Instead, Fugu introduces a highly sophisticated multi-model orchestration architecture. Newly released benchmark data shows that this fresh approach works. Fugu actively outperformed Anthropic’s top-tier, highly classified Claude 5 models on critical engineering, coding, and reasoning benchmarks.

Here is an in-depth breakdown of how this Japanese startup dethroned some of the most powerful models in existence. We will also explore the geopolitical drama surrounding Claude 5 and what this means for the future of decentralized AI.

Redefining Architecture: What is the Fugu System?

To understand this milestone, we must look at how the Fugu system works. Most mainstream AI products rely on a monolithic model. This means a single, massive neural network handles every query. It writes poems, debugs software, and solves complex physics problems using the exact same system.

Sakana AI took an entirely different route. Fugu does not rely on one single model. Instead, it functions as an intelligent, high-speed orchestrator. It coordinates multiple specialized AI models through a single API.

When a user submits a complex task, Fugu analyzes the request. It quickly breaks the task down into smaller components. Then, it routes those components to the specific models best suited to solve them. Finally, it synthesizes the final answer seamlessly.

Sakana has launched two primary commercial versions of this system:

Fugu: This agile version optimizes daily, iterative workflows. It excels at software development, conversational interfaces, and everyday productivity tasks.
Fugu Ultra: This high-performance iteration handles intensive workloads. It thrives on autonomous scientific research, academic paper reproduction, deep cybersecurity analysis, and patent investigations.

Breaking Down the Benchmarks: Fugu vs. Claude 5

The claim that a new system bypassed Anthropic’s flagship models is bold. However, Sakana backed up the launch with verified data across several open-source and specialized benchmarks. The results show Fugu pulling ahead of both Claude Fable 5 and the Claude Mythos Preview. These models represent Anthropic’s absolute peak technologies.

1. Advanced Software Engineering: LiveCodeBench

LiveCodeBench is a highly regarded, open-source benchmarking platform. It tests an AI’s authentic coding performance. Models can accidentally memorize static datasets during training. To avoid this, LiveCodeBench continuously refreshes its database with brand-new, real-world software problems. Fugu’s multi-model architecture showed superior logical structuring and syntax accuracy on these fresh tasks.

2. Graduate-Level Science and Logic: GPQA-Diamond

Experts consider the GPQA-D (Graduate-Level Google-Proof Q&A Diamond) benchmark one of the toughest academic tests for AI. It consists of 198 ultra-complex, multiple-choice questions in biology, physics, and chemistry. Domain experts write these questions carefully. They structure them so that a standard search engine cannot easily find the answer. Fugu’s coordinated approach managed to edge past Anthropic’s peak reasoning architecture here.

3. Outperforming the Global Field

Fugu’s success goes beyond defeating Anthropic. Sakana AI released data showing that the Fugu ecosystem consistently outperformed a laundry list of the industry’s heaviest hitters. This includes OpenAI’s GPT-5.5, Google’s Gemini 3.1 Pro, and the widely used Claude Opus 4.8.

Fugu achieved dominant scores across highly disparate fields, proving its immense versatility. The system demonstrated clear superiority in automated scientific research and complex mechanical design. It also excelled at financial time-series prediction, Japanese handwriting analysis, one-shot chess, and rapid Rubik’s Cube solving.

The Geopolitical Context: The Restricted Power of Claude 5

To truly appreciate Fugu’s benchmark victories, we must understand the current state of Anthropic’s Claude 5 line. Many everyday users have not yet interacted with Claude Fable 5 or Mythos 5. Unprecedented national security interventions by the United States government caused this delay.

The Rise and Fall of Mythos

Anthropic originally previewed its foundational underlying model, Mythos, earlier this year. However, the company immediately withheld it from mass public release. Internal safety audits revealed the model was simply too powerful.

During closed-door evaluations, Mythos demonstrated terrifying capabilities. It identified critical zero-day vulnerabilities in every major operating system and web browser it tested. Many of these security flaws had remained completely undetected by human engineers for decades.

Fear spread immediately among officials. If bad actors or state-sponsored hacking groups gained access to Mythos, they could cause chaos. They could weaponize it to dismantle critical infrastructure, compromise global banking networks, or synthesize advanced biological weapons.

Project Glasswing and the Three-Day Rollback

In response to these findings, Anthropic initiated Project Glasswing. This hyper-controlled, defensive cybersecurity initiative restricted access to the raw Mythos model. Only 50 highly vetted organizations globally received access. Tech infrastructure giants like Google, Apple, Amazon, Microsoft, and CrowdStrike used it to patch vulnerabilities before hackers could exploit them.

Later, Anthropic attempted a wider commercial deployment of its adjusted flagship model, Claude Fable 5. The rollout lasted a mere three days. Citing immediate national security risks, the U.S. government stepped in. They requested that Anthropic revoke all access to the model for foreign nationals and overseas entities.

To maintain public safety, Anthropic engineered severe guardrails into the commercial version of Fable 5. The system actively monitors user prompts for high-risk areas like advanced biochemical engineering or network penetration. If it detects these topics, the system instantly triggers an internal rollback. It automatically degrades its own capabilities to the older, less volatile Claude Opus 4.8 framework.

Anthropic had to shackle its models with aggressive safety triages. Because of this, Sakana AI found a golden opportunity. Their unburdened, highly coordinated Fugu system claimed the crown on open benchmarks.

The Masterminds Behind Sakana AI

Fugu’s sudden leap to the front of the AI paradigm may surprise some people. However, a look at the company’s lineage reveals an extraordinary pedigree of AI pioneers.

Based in Tokyo and founded in 2023, Sakana AI features founders who fundamentally shaped the modern AI landscape. Llion Jones co-founded the company. The industry recognizes him as one of the elite co-authors of Google’s landmark 2017 research paper, “Attention Is All You Need”. This historic paper introduced the Transformer architecture. Today, the Transformer serves as the literal foundation for almost every modern LLM, including GPT, Claude, and Gemini.

Jones partnered with David Ha to build the startup. Ha previously served as the widely respected head of research at Stability AI. Together, they focus on nature-inspired, evolutionary, and collaborative AI methods. They intentionally avoid brute-force data crunching.

A New Paradigm: Efficiency Over Size

The success of Sakana AI’s Fugu system signals a vital turning point for the technology sector. For years, the industry consensus followed a simple rule: whoever has the largest data center and the most electricity wins. Sakana has effectively broken that narrative.

Fugu creates an elegant, unified API that flawlessly orchestrates multiple specialized models. They tackle complex tasks collaboratively. By doing this, Fugu proves that architectural efficiency, routing logic, and system synergy can outperform monolithic giants. Geopolitical tensions continue to restrict the distribution of American AI models. Meanwhile, international trailblazers like Sakana AI prove a new truth: the future of intelligence is not just bigger, it is smarter.

Redefining Architecture: What is the Fugu System?

Breaking Down the Benchmarks: Fugu vs. Claude 5

1. Advanced Software Engineering: LiveCodeBench

2. Graduate-Level Science and Logic: GPQA-Diamond

3. Outperforming the Global Field

The Geopolitical Context: The Restricted Power of Claude 5

The Rise and Fall of Mythos

Project Glasswing and the Three-Day Rollback

The Masterminds Behind Sakana AI

A New Paradigm: Efficiency Over Size

Global Poll: Is...

SAP Labs to...

The Multi-Model Revolution: How Japan’s Sakana AI Outpaced Claude 5 and Redefined the AI Arms Race

Redefining Architecture: What is the Fugu System?

Breaking Down the Benchmarks: Fugu vs. Claude 5

1. Advanced Software Engineering: LiveCodeBench

2. Graduate-Level Science and Logic: GPQA-Diamond

3. Outperforming the Global Field

The Geopolitical Context: The Restricted Power of Claude 5

The Rise and Fall of Mythos

Project Glasswing and the Three-Day Rollback

The Masterminds Behind Sakana AI

A New Paradigm: Efficiency Over Size

Global Poll: Is...

SAP Labs to...