OpenAI Teaches AI to Think
Q3 2024 introduced a new paradigm: AI that reasons. OpenAI's o1, Meta's record-breaking 405B open model, and Flux's image generation breakthrough defined the quarter.
OpenAI Teaches AI to Think
From The Bit Baker Quarterly Roundup — Q3 2024
PLUS: Meta's 405B open-source giant, Flux takes over image gen, and AI video goes mainstream
Good morning, Dave. For two years, every AI advance boiled down to the same formula — more data, more parameters, more compute. Then in September, OpenAI did something different. They released o1, a model that pauses, thinks through its reasoning, catches its own mistakes, and then answers. It's slower. It costs more. And it crushed benchmarks that had stumped every previous model.
That shift — from pattern matching to deliberate reasoning — may end up being the defining moment of 2024. But Q3 had plenty more going on. Meta proved open source can genuinely compete at the frontier. A team of ex-Stability AI engineers dropped an image model that dethroned Midjourney overnight. And AI video crossed the line from curiosity to capability.
In this quarter's Bit Baker:
- OpenAI's o1 introduces chain-of-thought reasoning to production AI
- Meta's Llama 3.1 405B challenges closed models on their own turf
- Black Forest Labs' Flux disrupts AI image generation
- AI video goes mainstream with Runway Gen-3 Alpha
OpenAI's o1 Thinks Before It Speaks
The Bit Baker: On September 12, OpenAI released o1-preview — codenamed "Strawberry" — a model that uses internal chain-of-thought reasoning to work through problems step by step before generating an answer, and the results on hard benchmarks were staggering.
Unpacked:
- On competition math problems, o1 jumped from GPT-4o's roughly 13% accuracy to levels that rivaled human PhD performance. Science benchmarks saw similar leaps, with the model outperforming on physics, chemistry, and biology tasks that require multi-step logical deduction.
- The key innovation isn't a bigger model — it's reinforcement learning applied to reasoning itself. The model generates a hidden chain of thought, spots errors in its own logic, tries alternative approaches, and only then produces a final answer. Think of it as "thinking time" traded for accuracy.
- The tradeoff is speed and cost. o1 is noticeably slower and more expensive than GPT-4o, and the preview shipped without web access or image understanding — making it a specialist tool, not a general replacement.
Bottom line: o1 cracked open a new dimension in AI capability. Instead of just scaling models bigger, OpenAI showed that teaching models how to think — not just what to say — can unlock performance jumps that raw scale couldn't deliver. Every other lab is now racing to build their own reasoning layer.
Meta Ships a 405B-Parameter Open-Source Model
The Bit Baker: Meta released Llama 3.1 on July 23 in three sizes — 8B, 70B, and a massive 405B — making the largest openly available language model and one that rivals GPT-4o and Claude 3.5 Sonnet on major benchmarks.
Unpacked:
- The 405B model matches or beats closed competitors on reasoning, math, coding, and multilingual tasks across eight languages, trained on over 16,000 H100 GPUs. It's the first open model to credibly claim frontier-level performance.
- Context length jumped from 8K tokens in Llama 3 to 128K tokens — a 16x increase that makes it practical for analyzing long documents, entire codebases, and multi-turn conversations without losing context.
- Meta released weights under a permissive license, explicitly encouraging fine-tuning, distillation, and derivative models — a move that lets any company or researcher build custom AI without depending on API providers.
Bottom line: Meta's bet is that commoditizing the model layer hurts competitors who charge for API access more than it hurts Meta. Llama 3.1 405B made "open source can't compete with closed" a harder argument to make. And every startup that can't afford GPT-4o's pricing just got a viable alternative.
Flux Arrives and Dethrrones Midjourney Overnight
The Bit Baker: Black Forest Labs — founded by ex-Stability AI researchers — launched the FLUX.1 suite on August 1, and within weeks, the open-weight model had displaced Midjourney as the go-to for AI image generation in developer and creative communities.
Unpacked:
- FLUX.1 ships in three tiers: Pro (top quality via API), Dev (open weights for non-commercial use), and Schnell (Apache 2.0 licensed for local use) — giving everyone from hobbyists to enterprises a way in.
- The 12-billion-parameter model uses a hybrid diffusion transformer architecture with flow matching, producing images with noticeably better text rendering, prompt adherence, and anatomical accuracy than Midjourney v6, DALL-E 3, or Stable Diffusion 3.
- The open-weight release on Hugging Face meant the community immediately started building custom workflows in ComfyUI and integrating Flux into existing pipelines — an adoption speed that proprietary models can't match.
Bottom line: Flux proved that Stability AI's brain drain was someone else's gain. The model didn't just compete — it set a new standard for open image generation. And by launching with permissive licensing from day one, Black Forest Labs avoided the community trust problems that plagued Stability AI's final months.
AI Video Crosses the Capability Threshold
The Bit Baker: Runway launched Gen-3 Alpha in late June, and over Q3 a wave of competing video models from Kling, Luma, and others pushed AI video from "impressive demo" to "actually usable" for real production work.
Unpacked:
- Gen-3 Alpha produces 10-second clips with coherent motion, consistent lighting, and controllable camera angles — features like Motion Brush and Director Mode let creators steer the output rather than just prompting and hoping.
- The competitive landscape exploded. China's Kling model matched Gen-3's quality, Luma's Dream Machine pushed fast iteration, and xAI integrated Flux into Grok-2 for image generation — every major player wanted a seat at the table.
- Real production studios started using these tools. Runway signed a partnership with Lionsgate for film production, marking the first major Hollywood studio to formally adopt AI video generation as part of its workflow.
Bottom line: Q3 was when AI video stopped being a novelty and started being infrastructure. The tools aren't replacing cinematographers yet, but they're already changing how concept art, storyboarding, and pre-visualization work. The question shifted from "can AI make video?" to "who controls the pipeline?"
The Shortlist
OpenAI launched Structured Outputs in the API, guaranteeing that model responses follow a specified JSON schema — a seemingly small feature that unlocked reliable integration for thousands of production applications.
xAI released Grok-2 and Grok-2 mini in August, with frontier-level chat and coding capabilities plus built-in image generation via Flux, available to X Premium subscribers.
Anthropic quietly upgraded Claude 3.5 Sonnet in August with improved coding performance and launched the Artifacts feature more broadly, turning Claude into a real-time collaborative workspace.
Apple began rolling out Apple Intelligence betas to developers over the summer, though the full consumer launch wouldn't land until iOS 18.1 in October — a slower timeline than competitors expected.