Skip to main content

Command Palette

Search for a command to run...

Inside GPT-OSS: Open-Weight Reasoning Models Built for Agentic AI

GPT-OSS-20B and GPT-OSS-120B

Updated
8 min read
Inside GPT-OSS: Open-Weight Reasoning Models Built for Agentic AI

GPT-OSS refers to a pair of open-weight reasoning models, released under the Apache 2.0 license. Accessible here: gpt-oss-20b; gpt-oss-120b. Unlike proprietary ChatGPT, GPT-OSS is designed to be run, modified, and fine-tuned by developers. The models are optimized for agentic workflows, long-form reasoning, tool use, and structured outputs.

This article provides a practical overview and insight into GPT-OSS, its architecture, training pipeline, reasoning abilities, and how it fits in today’s context.

GPT-OSS is much more flexible than ChatGPT

  • Open-weight and flexible, allowing for tweaking and experimentation.

  • Designed for agentic workflows (Python and web search)

  • Capable of long CoT (chain of thought)

  • Customizable by developers

As of today, OpenAI operates ChatGPT, one of the most widely used and influential public conversational AI models. The company releasing an open-source model available to the public gains a lot of attention in both academia and industry. However, it is clear that GPT-OSS has a very different safety profile than OpenAI’s flagship product and API endpoints. This difference is not due to weaker training, but rather because open-weight models cannot rely on centralized, system-level safeguards once they are released. GPT-OSS-20B model is meant to be used on regular consumer hardware with a minimum of 16GB main memory, whereas GPT-OSS-120B is meant to be used on servers. While one can use ChatGPT out-of-the-box, GPT-OSS requires some basic coding to get it running. I personally find that writing a small script in Python does the job quickly and efficiently. Otherwise, to get a feel for how the model responds, HuggingFace provides an inference option on the model landing page. However, this is a freemium option and may be discontinued in the future.

Architecture Overview

GPT-OSS models are autoregressive Mixture-of-Experts (MoE) transformers built on architectural ideas from GPT-2 and GPT-3.

Model Sizes

  • GPT-OSS-120B: 116.8B total parameters, 36 layers, 5.1B active parameters per token.

  • GPT-OSS-20B: 20.9B total parameters, 24 layers, 3.6B active parameters per token.

For enabling efficient inference, only a subset of experts is active per token, even though the total number of parameter count are large.

Quantization and Hardware Efficiency

GPT-OSS applies post-training quantization to MoE weights using the MXFP4 format (4.25 bits per parameter). This helps reduce the memory footprint, and users do not need to have access to a large server farm to infer on the model. Thus, this enables:

  • GPT-OSS-120B to be able to run on an 80GB GPU

  • GPT-OSS-20B to be able to run with only 16GB of memory

Apparently, even with lower computing resources, users can infer models locally. Also, potentially secure network use-cases like handling payslips, banking documents, and medical information can be summarized using the local model without risk of any data leakage over the internet. Without quantization, individual researchers, small companies, and even regular users wouldn’t have been able to communicate with models with good features and chat abilities locally.

Tokenizer

TikToken is a library available on GitHub, maintained by OpenAI. It contains o200k_harmony tokenizer, which is a Byte Pair Encoding (BPE) tokenizer extended with special tokens for role-based and channel-based chat formatting. Byte Pair Encoding (BPE) is a method that breaks text into small, reusable pieces so a language model can efficiently understand and generate words, even ones it has never seen before.

Pretraining

Data

GPT-OSS models are trained on a text-only dataset containing trillions of tokens with special emphasis on:

  • STEM and mathematics

  • Programming and code

  • General knowledge

Harmful content, like biosecurity, was filtered using CBRN pretraining filters developed initially for GPT-4o.

Training Infrastructure

Training was conducted on NVIDIA H100 GPUs using PyTorch with expert-optimized Triton kernels and FlashAttention:

  • GPT-OSS-120B: approximately 2.1M H100 GPU-hours

  • GPT-OSS-20B: approximately 10x fewer GPU-hours

Post-Training and Reasoning

After pretraining, models were post-trained using CoT (chain of thought) reinforcement learning techniques, in a similar way to how OpenAI’s o3 models were trained. Models are taught:

  • How to reason step by step?

  • How to solve complex math and coding problems?

  • How to use tools such as Python execution and web browsing?

Harmony Chat Format

GPT-OSS uses the harmony chat format, which is a role-based messaging structure with explicit message boundaries like System, Developer, User, and Assistant. Imagine you instruct another person to help you prepare for final exams, just act as an agent who only asks you questions, whereas you are the agent who will just answer. GPT-OSS is trained using a similar system to follow the User-Assistant setting to chat with an human-user.

The format also introduces channels:

  • analysis: internal chain-of-thought

  • commentary: tool calls

  • final: user-visible output

This structure enables advanced agentic behaviors but requires careful handling, especially in multi-turn conversations. One has to keep in mind, user-AI agent role mustn’t be reversed, otherwise the application wouldn’t serve its purpose.

Variable Reasoning Effort

In ChatGPT, one can choose various options like Auto, Instant, Thinking, Pro, and Deep-Research. This helps ChatGPT determine how long the user is willing to wait for a response and how long CoT should apply to get the desired output. Similarly, in GPT-OSS, we have 3 different effort levels:

  • low

  • medium

  • high

These are set via system prompt keywords (e.g., Reasoning: high). Higher levels result in longer chain-of-thought and improved accuracy, at the cost of latency and compute.

Agentic Tool Use

A large language model that is trained on fewer parameters is prone to hallucinations; it is to the benefit of users that the model has the capability to interact with tools. GPT-OSS models are trained to interact with:

  • Web browsing via search and open

  • Python code execution in a stateful Jupyter environment

  • Arbitrary developer-defined functions with schemas.

Tool use can be toggled to be enabled/disabled through the system prompt.

Evaluation Results

Reasoning and Coding

While it is debatable who performs the best in solving complex problems amongst Gemini, Claude AI, ChatGPT, and other public chat models, GPT-OSS, among the offline models, is particularly strong in mathematics and reasoning-heavy tasks. For example, GPT-OSS-20B uses over 20k chain-of-thought tokens per AIME problem on average. Both models perform strongly on coding and tool-use benchmarks, with GPT-OSS-120B approaching the performance of OpenAI’s o4-mini.

Health Performance

GPT-OSS models perform competitively on HealthBench evaluations.

Notably, GPT-OSS-120B approaches OpenAI o3 performance and outperforms several frontier closed models. However, it is important to note that the advice from a large language model is not intended to replace medical professionals.

Multilingual Performance

On MMMLU benchmarks across 14 languages, GPT-OSS-120B at a high reasoning effort setting comes close to o4-mini-high performance.

Safety and Limitations

Even the best of systems have limitations, and so do GPT-OSS models. As it is an open-weight light model, it doesn’t have the same safety standards as OpenAI’s proprietary counterparts and also has other limitations.

Preparedness Framework

GPT-OSS-120B does not reach OpenAI’s “High capability” thresholds under the Preparedness Framework, even after adversarial fine-tuning.

  • Biological and chemical capability

  • Cyber capability

  • AI self-improvement

Disallowed Content and Jailbreaks

On standard disallowed content evaluations, both models perform on par with OpenAI o4-mini. GPT-OSS-20B slightly underperforms on illicit/violent categories but still exceeds GPT-4o. Robustness to jailbreaks using the StrongReject benchmark is comparable to o4-mini.

Instruction Hierarchy

All public chat LLM models should/must follow the instruction hierarchy to behave like a chatbot; otherwise, they would fail to do the bare-minimum itself, chat. GPT-OSS follows system and developer message priorities; it generally underperforms o4-mini on instruction hierarchy tests such as system prompt extraction and prompt injection hijacking.

Chain-of-Thought and Hallucinations

GPT-OSS chains of thought are intentionally unrestricted, thus:

  • CoT may contain hallucinations

  • CoT may include unsafe or unfiltered language

Researchers found that when models are explicitly discouraged from expressing certain thoughts, they can instead learn to conceal their reasoning while continuing to behave incorrectly. Developers are advised not to expose raw chain-of-thought to end users, as it may contain hallucinated or unfiltered language that is inappropriate for direct display.

Fairness and Bias

On the BBQ fairness benchmarks, GPT-OSS models perform at roughly the same level as OpenAI o4-mini.

Conclusion

GPT-OSS represents a significant step forward for open-weight reasoning models. It combines long CoT reasoning, agentic tool use, and strong math and coding performance in a form developers can fully control. Training an LLM from scratch is technically infeasible for the majority of individuals because of the massive compute requirement, unless one owns a supercomputer at home or is spending a fortune on cloud infrastructure. Technical freedom and openess shifts the burden of responsibility. OpenAI’s model card and license clearly list GPT-OSS’s shortcomings and where it may fail. Like any other thing in the world, for example, a car can be used both as a transportation medium and a lethal machine when in the hands of a criminal. An LLM has to be used responsibly, especially when the restrictions are lifted. Safety, deployment, and monitoring are no longer handled by a centralized API, but by the system builder. However, for teams building agentic systems, this is not a drop-in replacement for ChatGPT, but a powerful foundation.

References

[1] OpenAI et al., “gpt-oss-120b & gpt-oss-20b Model Card,” Aug. 08, 2025, arXiv: arXiv:2508.10925. doi: 10.48550/arXiv.2508.10925.

Disclaimer
All images and tables are reproduced from [1]. Cover Image is obtained from HuggingFace. This article is an independent technical summary and interpretation of the OpenAI GPT-OSS model card. The author is not affiliated with OpenAI, and any opinions or interpretations are solely those of the author.

More from this blog

S

Simply ML - Machine Learning Made Clear

3 posts

Exploring machine learning, LLM, and computer vision—practical insights from research and real-world AI systems. Curated by Sagnik Das.