OpenAI Launches GPT-5.4 With 1 Million Token Context Window and Human-Surpassing Benchmark Scores
OpenAI's latest frontier model offers a 1M-token context window, a dedicated reasoning mode, and benchmark scores that surpass human performance on real-world desktop tasks for the first time.
GPT-5.4: OpenAI's Most Capable Model Yet
OpenAI has released GPT-5.4, its latest frontier AI model, billing it as "our most capable and efficient frontier model for professional work." The release comes in three variants: the standard GPT-5.4, a reasoning-optimized GPT-5.4 Thinking, and a high-performance GPT-5.4 Pro tier.
The headline number is the context window: GPT-5.4 supports up to 1 million tokens through the API, doubling GPT-5.2's capacity and matching what Google and Anthropic offer. That's enough to process entire codebases, lengthy legal documents, or massive research datasets in a single session.
Surpassing Humans at Computer Tasks
Perhaps the most striking claim is GPT-5.4's performance on the OSWorld-Verified benchmark, which evaluates how effectively an AI system can navigate desktop environments — managing files, editing documents, interacting with applications, and executing multi-step workflows. GPT-5.4 scored 75%, edging past the human baseline of 72.4%.
That represents a significant leap from GPT-5.2, which scored roughly half that. The model also posted record scores on WebArena Verified (web navigation tasks) and scored 83% on OpenAI's GDPval test for knowledge-work tasks across dozens of professions.
Brendan Foody, CEO of benchmarking firm Mercor, noted the model's strength in professional workflows:
"GPT-5.4 excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models."
Smarter, Not Just Bigger
OpenAI emphasized that GPT-5.4 isn't just more powerful — it's more efficient. The company claims the model solves the same problems with significantly fewer tokens than its predecessor, meaning lower costs for developers. Hallucinations are down too: GPT-5.4 is 33% less likely to make errors in individual claims and responses are 18% less likely to contain factual errors overall compared to GPT-5.2.
Tool Search: A New Approach to API Calls
One of the more technically interesting additions is Tool Search, a new system for managing function calls in the API. Previously, developers had to include definitions for all available tools in the system prompt — a costly approach as tool libraries grew. Tool Search lets the model look up tool definitions on demand, resulting in faster and cheaper API requests for applications with many available tools.
This positions GPT-5.4 less as a chatbot and more as an agentic system — one designed to operate across multiple software environments, plan multi-step workflows, and execute tasks autonomously over extended periods.
Safety: Can You Trust the Chain of Thought?
OpenAI also addressed a growing concern in AI safety: whether reasoning models can deceive users by misrepresenting their chain-of-thought. Research from Anthropic has shown this can happen under certain circumstances.
OpenAI released a new safety evaluation showing that deception is less likely in GPT-5.4 Thinking, "suggesting that the model lacks the ability to hide its reasoning and that CoT monitoring remains an effective safety tool." Whether that holds up under adversarial conditions remains to be seen.
The Bigger Picture
GPT-5.4 arrives at a moment when frontier models are rapidly shifting from conversational assistants to autonomous digital workers. With its million-token context window, improved reasoning, and desktop-task performance that now matches or exceeds human levels, OpenAI is making its clearest bet yet on the age of AI agents.
The model is available now to ChatGPT Plus and Pro subscribers, with API access rolling out to developers. Pricing details for the 1M-token tier have not yet been disclosed.
The AI race continues to accelerate. The question is no longer whether these models can do complex work — it's how quickly businesses and institutions will be willing to let them.
0 Comments
No comments yet. Be the first to say something.