AI Technology

OpenAI's Codex-Spark and the Cerebras Bet: Why 1,000 Tokens Per Second Changes Everything

OpenAI's first custom chip partnership delivers real-time AI coding at 1,000+ tokens per second. Here's why the Cerebras deal matters more than the speed number.

william murray

14 Feb 2026 • 2 min read

From The Bit Baker newsletter — February 14, 2026

On February 12, OpenAI did something it had never done: ship a model built from scratch for someone else's silicon. GPT-5.3-Codex-Spark is a distilled, compact variant of GPT-5.3-Codex, purpose-built to run on Cerebras' Wafer-Scale Engine 3. The headline? Over 1,000 tokens per second -- roughly 15x faster than standard Codex. But the partnership behind it might matter more than the speed.

This isn't a general-purpose model. Codex-Spark is a specialist, tuned for lightweight coding work: targeted edits, regex generation, syntax cleanup, rapid-fire iterations. In a demo, it built a working snake game in nine seconds. Standard Codex took 43. The tradeoff is straightforward -- reasoning depth for velocity -- using a distilled architecture compact enough to fit within the WSE-3's on-chip SRAM.

ChatGPT Pro subscribers get the research preview through the Codex app, CLI, and IDE extensions. It handles a 128k context window (text only), delivers 50% faster time-to-first-token, and costs 30% less per token than the full-sized model.

Why It Matters

For years, OpenAI's compute story has been an Nvidia story. A $100 billion GPU deal. An entire infrastructure built on CUDA. That kind of dependence brings pricing risk, supply bottlenecks, and a single point of failure in the hardware stack.

The Cerebras deal says OpenAI is done putting all its chips -- pun intended -- in one basket. The move tracks with a broader diversification push: a six-gigawatt AMD MI450 agreement signed in October 2025, Google Cloud TPU rentals starting mid-2025, and a custom Broadcom chip targeting mass production this year. OpenAI is assembling a multi-vendor stack, and Cerebras just became a load-bearing pillar.

For Cerebras, landing OpenAI is the kind of customer win that changes a company's trajectory. They've been pitching wafer-scale tech for years, but this puts them on the map. The WSE-3 -- one enormous chip with 4 trillion transistors and massive on-chip SRAM -- sidesteps the memory bottleneck that throttles GPU-based inference. That's the physics behind 1,000 tokens per second.

What's Under the Hood

The speed gap comes down to architecture. Standard GPU inference shuffles data between compute cores and external HBM (high-bandwidth memory) -- every transfer adds latency. The WSE-3 keeps everything in SRAM, which is orders of magnitude faster than HBM. The catch? SRAM is expensive per bit, so the model needs to be small enough to fit. Hence the distillation.

OpenAI frames Codex-Spark as a "sub-agent." The bigger GPT-5.3-Codex handles the thinking -- planning, architecture, complex reasoning. Spark handles the typing -- boilerplate, fixes, syntax corrections. Picture a senior developer sketching out the approach while a fast-fingered junior developer writes the code.

Persistent WebSocket connections shave 80% off overhead. And despite being smaller, Spark beats GPT-5.1-Codex-mini on SWE-Bench Pro and Terminal-Bench 2.0. The design logic is clean: match the model to the task, match the hardware to the model.

What to Watch

More chip partners on deck. OpenAI called this a "first milestone." With AMD, Broadcom, and Google TPU relationships already running, expect hardware-specific model variants for other platforms before the year is out.
The sub-agent pattern going mainstream. Codex-Spark proves that smaller, task-tuned models on specialized silicon can outrun general-purpose models for narrow jobs. Other labs will follow.
Nvidia pricing pressure. Every workload OpenAI moves off Nvidia GPUs chips away at Jensen Huang's leverage. How fast other AI labs copy this diversification playbook is the real question.

OpenAI's Codex-Spark and the Cerebras Bet: Why 1,000 Tokens Per Second Changes Everything

william murray

Why It Matters

What's Under the Hood

What to Watch

References

Sign up for more like this.