Chain of Thought

Vibe Check: Codex—OpenAI's New Coding Agent

Our hands-on day-0 review of the new autonomous software engineer

🎧 Bonus: A special episode of AI & I with OpenAI product team member Alexander Embiricos is now live on X.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Last night I shipped a new feature for Cora, Every's AI-enabled email assistant. Cora is not a vibe-coded product: Its codebase is a 5,5000-plus commit cathedral of Rails craftsmanship mostly from Kieran Klaassen, Cora's general manager, our resident DHH. Needless to say, exactly zero previous commits are mine.

Undaunted (or somewhat daunted, but holding my shit together), I pressed "merge" on a pull request for a little quality-of-life UI update, and what had previously been just a twinkle in my eye appeared on production in less than an hour. How?

I used Codex—OpenAI's new coding agent, launching publicly as a research preview today. Like Devin, Codex is designed as a software engineer that can build features and fix bugs autonomously. OpenAI has tried to incorporate the taste of a senior software engineer into how Codex writes code: It's familiar with how large codebases work, and writes clean code and commit messages. It's designed for you to run many sessions in parallel, so you can manage a team of agents without touching a single line of code.

In OpenAI's storied tradition, Codex is confusingly named. The company has previously used the same name for both a model and a command-line tool. (Here's an o3 summary of the full history of OpenAI's use of this name.) It's a little rough around the edges and balkanized into a product separate from ChatGPT (more on this later). Even so, it's useful.

We've been testing it for the last few days at Every. What follows is our day-zero vibe check. I invited Kieran to help me write this review because Codex, unlike many other AI coding agents, is clearly built for senior engineers like himself, and I think his perspective is important.

We'll go through what it is, how it works, what to use it for, and what it means. But first: the Reach Test.

Public is making your ideas an investable reality with AI

Multi-asset investing platform Public has just launched Generated Assets, enabling you to turn your ideas into investable assets with AI. Just enter your prompt into the chat-based interface in as much or as little detail as you like. In seconds, the AI builds a Generated Asset tailored to your vision. You can analyze its past performance, compare it to the S&P 500, and fine-tune the holdings to your liking. Then, share your Generated Asset with the world—and see how it stacks up against the rest.

Get started with your idea

Want to sponsor Every? Click here.

The Reach Test: Do we use it every day?

The best leading indicator for long-term usefulness of an AI product is what I'm calling the Reach Test: Do I find myself automatically turning to this tool to do certain tasks? Or do I just leave it on the shelf and forget it's there?

Here are the results of our Reach Test on Codex:

Kieran (agent-pilled tech lead archetype): Yes, he's thinking about how to use it all day and night.
Dan (technical tinkerer CEO, weekend vibe coder archetype): No, but because I'm normally coding on net new ideas rather than existing products.

Overall: It's a tool you'll reach for if you're a tech lead adding features or fixing bugs on an existing codebase. If you're trying to vibe code a new one-person billion-dollar SaaS company, look elsewhere.

A Codex tour

Codex presents you with a simple text box that asks you to describe the programming task you want it to perform, followed by two buttons: "Ask" and "Code":

Source: Codex/Author's screenshot.

This is a telling user interface. It's not a chat box, like ChatGPT, where you can casually go back and forth. Instead, it's a text box for you to type in your well-specified coding task for it to complete.

When you press "Code," this screen doesn't disappear. Most other agent products open up a new window for you to watch the agent do its work. Instead, Codex slides your piping-hot new job into its task queue and blinks status updates as it gets going.

If you click into the task as it's running, you can see a detailed log of what it's thinking as it's working:

Source: Codex/Author's screenshot.

For each new task, it spins up your codebase in its own sandboxed environment in which it can run tests and linters—aka spell check for programmers—so it can catch errors itself. Unlike Devin, it does not yet have access to a browser where it can use the code it's written to see if it does what it's supposed to do.

When it's finished, it'll give you a concise summary of what it did along with a code diff so you can see exactly what changes it made. It also has a button for you to easily submit its changes as a pull request on Github: