Simply ML - Machine Learning Made Clear

Inside GPT-OSS: Open-Weight Reasoning Models Built for Agentic AI

Sagnik Das — Thu, 22 Jan 2026 07:00:18 GMT

GPT-OSS refers to a pair of open-weight reasoning models, released under the Apache 2.0 license. Accessible here: gpt-oss-20b; gpt-oss-120b. Unlike proprietary ChatGPT, GPT-OSS is designed to be run, modified, and fine-tuned by developers. The models are optimized for agentic workflows, long-form reasoning, tool use, and structured outputs.

This article provides a practical overview and insight into GPT-OSS, its architecture, training pipeline, reasoning abilities, and how it fits in today’s context.

GPT-OSS is much more flexible than ChatGPT

Open-weight and flexible, allowing for tweaking and experimentation.
Designed for agentic workflows (Python and web search)
Capable of long CoT (chain of thought)
Customizable by developers

As of today, OpenAI operates ChatGPT, one of the most widely used and influential public conversational AI models. The company releasing an open-source model available to the public gains a lot of attention in both academia and industry. However, it is clear that GPT-OSS has a very different safety profile than OpenAI’s flagship product and API endpoints. This difference is not due to weaker training, but rather because open-weight models cannot rely on centralized, system-level safeguards once they are released. GPT-OSS-20B model is meant to be used on regular consumer hardware with a minimum of 16GB main memory, whereas GPT-OSS-120B is meant to be used on servers. While one can use ChatGPT out-of-the-box, GPT-OSS requires some basic coding to get it running. I personally find that writing a small script in Python does the job quickly and efficiently. Otherwise, to get a feel for how the model responds, HuggingFace provides an inference option on the model landing page. However, this is a freemium option and may be discontinued in the future.

Architecture Overview

GPT-OSS models are autoregressive Mixture-of-Experts (MoE) transformers built on architectural ideas from GPT-2 and GPT-3.

Model Sizes

GPT-OSS-120B: 116.8B total parameters, 36 layers, 5.1B active parameters per token.
GPT-OSS-20B: 20.9B total parameters, 24 layers, 3.6B active parameters per token.

For enabling efficient inference, only a subset of experts is active per token, even though the total number of parameter count are large.

Quantization and Hardware Efficiency

GPT-OSS applies post-training quantization to MoE weights using the MXFP4 format (4.25 bits per parameter). This helps reduce the memory footprint, and users do not need to have access to a large server farm to infer on the model. Thus, this enables:

GPT-OSS-120B to be able to run on an 80GB GPU
GPT-OSS-20B to be able to run with only 16GB of memory

Apparently, even with lower computing resources, users can infer models locally. Also, potentially secure network use-cases like handling payslips, banking documents, and medical information can be summarized using the local model without risk of any data leakage over the internet. Without quantization, individual researchers, small companies, and even regular users wouldn’t have been able to communicate with models with good features and chat abilities locally.

Tokenizer

TikToken is a library available on GitHub, maintained by OpenAI. It contains o200k_harmony tokenizer, which is a Byte Pair Encoding (BPE) tokenizer extended with special tokens for role-based and channel-based chat formatting. Byte Pair Encoding (BPE) is a method that breaks text into small, reusable pieces so a language model can efficiently understand and generate words, even ones it has never seen before.

Pretraining

Data

GPT-OSS models are trained on a text-only dataset containing trillions of tokens with special emphasis on:

STEM and mathematics
Programming and code
General knowledge

Harmful content, like biosecurity, was filtered using CBRN pretraining filters developed initially for GPT-4o.

Training Infrastructure

Training was conducted on NVIDIA H100 GPUs using PyTorch with expert-optimized Triton kernels and FlashAttention:

GPT-OSS-120B: approximately 2.1M H100 GPU-hours
GPT-OSS-20B: approximately 10x fewer GPU-hours

Post-Training and Reasoning

After pretraining, models were post-trained using CoT (chain of thought) reinforcement learning techniques, in a similar way to how OpenAI’s o3 models were trained. Models are taught:

How to reason step by step?
How to solve complex math and coding problems?
How to use tools such as Python execution and web browsing?

Harmony Chat Format

GPT-OSS uses the harmony chat format, which is a role-based messaging structure with explicit message boundaries like System, Developer, User, and Assistant. Imagine you instruct another person to help you prepare for final exams, just act as an agent who only asks you questions, whereas you are the agent who will just answer. GPT-OSS is trained using a similar system to follow the User-Assistant setting to chat with an human-user.

The format also introduces channels:

analysis: internal chain-of-thought
commentary: tool calls
final: user-visible output

This structure enables advanced agentic behaviors but requires careful handling, especially in multi-turn conversations. One has to keep in mind, user-AI agent role mustn’t be reversed, otherwise the application wouldn’t serve its purpose.

Variable Reasoning Effort

In ChatGPT, one can choose various options like Auto, Instant, Thinking, Pro, and Deep-Research. This helps ChatGPT determine how long the user is willing to wait for a response and how long CoT should apply to get the desired output. Similarly, in GPT-OSS, we have 3 different effort levels:

low
medium
high

These are set via system prompt keywords (e.g., Reasoning: high). Higher levels result in longer chain-of-thought and improved accuracy, at the cost of latency and compute.

Agentic Tool Use

A large language model that is trained on fewer parameters is prone to hallucinations; it is to the benefit of users that the model has the capability to interact with tools. GPT-OSS models are trained to interact with:

Web browsing via search and open
Python code execution in a stateful Jupyter environment
Arbitrary developer-defined functions with schemas.

Tool use can be toggled to be enabled/disabled through the system prompt.

Evaluation Results

Reasoning and Coding

While it is debatable who performs the best in solving complex problems amongst Gemini, Claude AI, ChatGPT, and other public chat models, GPT-OSS, among the offline models, is particularly strong in mathematics and reasoning-heavy tasks. For example, GPT-OSS-20B uses over 20k chain-of-thought tokens per AIME problem on average. Both models perform strongly on coding and tool-use benchmarks, with GPT-OSS-120B approaching the performance of OpenAI’s o4-mini.

Health Performance

GPT-OSS models perform competitively on HealthBench evaluations.

Notably, GPT-OSS-120B approaches OpenAI o3 performance and outperforms several frontier closed models. However, it is important to note that the advice from a large language model is not intended to replace medical professionals.

Multilingual Performance

On MMMLU benchmarks across 14 languages, GPT-OSS-120B at a high reasoning effort setting comes close to o4-mini-high performance.

Safety and Limitations

Even the best of systems have limitations, and so do GPT-OSS models. As it is an open-weight light model, it doesn’t have the same safety standards as OpenAI’s proprietary counterparts and also has other limitations.

Preparedness Framework

GPT-OSS-120B does not reach OpenAI’s “High capability” thresholds under the Preparedness Framework, even after adversarial fine-tuning.

Biological and chemical capability
Cyber capability
AI self-improvement

Disallowed Content and Jailbreaks

On standard disallowed content evaluations, both models perform on par with OpenAI o4-mini. GPT-OSS-20B slightly underperforms on illicit/violent categories but still exceeds GPT-4o. Robustness to jailbreaks using the StrongReject benchmark is comparable to o4-mini.

Instruction Hierarchy

All public chat LLM models should/must follow the instruction hierarchy to behave like a chatbot; otherwise, they would fail to do the bare-minimum itself, chat. GPT-OSS follows system and developer message priorities; it generally underperforms o4-mini on instruction hierarchy tests such as system prompt extraction and prompt injection hijacking.

Chain-of-Thought and Hallucinations

GPT-OSS chains of thought are intentionally unrestricted, thus:

CoT may contain hallucinations
CoT may include unsafe or unfiltered language

Researchers found that when models are explicitly discouraged from expressing certain thoughts, they can instead learn to conceal their reasoning while continuing to behave incorrectly. Developers are advised not to expose raw chain-of-thought to end users, as it may contain hallucinated or unfiltered language that is inappropriate for direct display.

Fairness and Bias

On the BBQ fairness benchmarks, GPT-OSS models perform at roughly the same level as OpenAI o4-mini.

Conclusion

GPT-OSS represents a significant step forward for open-weight reasoning models. It combines long CoT reasoning, agentic tool use, and strong math and coding performance in a form developers can fully control. Training an LLM from scratch is technically infeasible for the majority of individuals because of the massive compute requirement, unless one owns a supercomputer at home or is spending a fortune on cloud infrastructure. Technical freedom and openess shifts the burden of responsibility. OpenAI’s model card and license clearly list GPT-OSS’s shortcomings and where it may fail. Like any other thing in the world, for example, a car can be used both as a transportation medium and a lethal machine when in the hands of a criminal. An LLM has to be used responsibly, especially when the restrictions are lifted. Safety, deployment, and monitoring are no longer handled by a centralized API, but by the system builder. However, for teams building agentic systems, this is not a drop-in replacement for ChatGPT, but a powerful foundation.

References

[1] OpenAI et al., “gpt-oss-120b & gpt-oss-20b Model Card,” Aug. 08, 2025, arXiv: arXiv:2508.10925. doi: 10.48550/arXiv.2508.10925.

Disclaimer

Why Variational Autoencoders Are the Secret Sauce of the GenAI Revolution

Sagnik Das — Thu, 31 Jul 2025 10:11:48 GMT

Introduction: The GenAI Age — Creating Instead of Curating

Generative AI (GenAI) is transforming the landscape of technology, powering applications which can generate audio, art, or photorealistic images. We have seen a recent trend of Ghibli images being famous throughout the world, as millions of users use ChatGPT DALL-E to generate cartoon-like images by uploading their original images. Although the advanced image generation models use Diffusion techniques instead of Variational Autoencoders, it is interesting to understand how it all started. Generation of any new article, music, or drawing is particularly challenging for humans, as the human mind works best for classification (e.g. distinguishing between two or more objects). The same is the challenge for machines, while classification tasks are comparatively easy for computational engines, generation requires more intricate techniques and, of course, lots of computational power. Images are high-dimensional representations of the real world scene, and the generation of new images would not be possible if the machine doesn’t understand how to deal with the same. Variational Autoencoders, a foundational model, are thus introduced to solve this problem.

What is a Variational Auto Encoder (VAE)?

A VAE learns to encode real-world data (e.g. images, text, speech) into a compressed latent space before reconstructing or generating new data points from this space.

How does it work?

Encoder: Projects input x into a lower-dimensional latent variable z, which represents the original input.
Latent Space: Latent space is a lower-dimensional, structured representation learned, where complex data, such as images or sounds, are encoded as abstract, meaningful features that capture the underlying variations in the data.
Decoder: Regenerates data from sampled latent representations, which means the network can create new similar data, not just reconstruct the input.

$$\text{Dimension of Input Variable:} \quad x \in \mathbb{R}^D$$

$$\text{Dimension of Latent Variable:} \quad z \in \mathbb{R}^K$$

$$\text{where} \quad K <

Image by EugenioTL - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=107231101

Math Behind the Magic

The core of Variational Autoencoders (VAEs) is the optimisation of the Evidence Lower Bound (ELBO), which balances two goals:

Accurately reconstructing data from the latent space.
Ensuring the latent encodings follow a standard, well-behaved distribution (typically Gaussian).

$$\mathcal{L}(x_n, z_n, \psi, \theta^x) = -\log p(x_n | z_n, \theta^x) + D_{KL}\big( q(z_n|x_n, \psi) \;\|\; p(z_n) \big)$$

where:

$$p(x_n | z_n, \theta^x) \; \text{is the likelihood term for reconstructing the data}$$

$$q(z_n|x_n, \psi) \; \text{is the approximate posterior from the encoder}$$

$$p(z_n) \; \text{is the prior over latent variables (usually a standard normal).}$$

$$D_{KL}\big( q(z_n|x_n, \psi) \;\|\; p(z_n) \big) \; \text{is the Kullback-Leibler (KL) divergence, regularizing the latent space.}$$

The reparameterization trick allows gradients to flow through stochastic nodes, making training possible (crucial for GenAI scalability).

Core Intuition

So far, we have defined the loss function of the Variational Autoencoder for training the model to reconstruct (or generate) new data samples. Minimising the loss would lead to the output of the decoder being close to the actual image input. Using techniques such as Backpropagation and Stochastic Gradient Descent, we compute the gradient of the loss function with respect to all network parameters by applying the chain rule backwards through the network layers — from the decoder output, through the latent space, back to the encoder input. Then, we update network parameters to minimise the loss function.

Complete ELBO Loss is a combination of two parts:

Reconstruction Loss: Measures how well the decoder can recreate the original input from the latent representation, typically using mean squared error or binary cross-entropy.
KL Divergence Loss: Regularises the encoder’s latent distribution to stay close to a standard normal distribution, ensuring the latent space remains well-structured and enabling smooth generation.

Inference Phase: From Latent Code to New Data

After training is completed, we remove the encoder block from the scenario and now rely on the decoder block to generate new data samples (e.g. images). Unlike training (which encodes real data into latent space), inference starts from the latent space and generates entirely new outputs.

VAE inference follows a straightforward three-step process:

Sample from the prior: Draw a random latent code from the learned prior distribution (typically standard normal).

$$z’ \sim p(z|\theta^z)$$

Compute decoder parameters: Use the decoder network to map the latent code to output distribution parameters.

$$\hat{\theta}^x = f(z’; \theta^x)$$

Generate the output: Sample the final output from the parameterised distribution.

$$x \sim p(x|\hat{\theta}_x)$$

The ELBO Reparameterization: Making VAE Training Possible

While the ELBO objective we discussed earlier provides the conceptual framework for VAE training, there’s a critical mathematical challenge: how do you backpropagate through random sampling? The answer lies in an elegant mathematical trick called the reparameterization trick.

$$z_n = h(x_n; \psi)\mu + \epsilon_n h(x_n; \psi)\sigma$$

Instead of sampling z directly from the encoder’s output distribution, we express it as above.

where:

$$h(x_n; \psi)\mu \; \text{and} \; h(x_n; \psi)\sigma \; \text{are the mean and standard deviation output by the encoder}$$

$$\epsilon_n \sim \mathcal{N}(0,1) \; \text{is noise sampled from a standard normal distribution}$$

This reparameterization transforms the optimisation problem into

$$\arg\min_{\psi,\theta^x} \mathbb{E}{x_n,\epsilon_n} \;\mathcal{N}(x_n|f(z_n; \theta^x)\mu, f(z_n; \theta^x)\sigma) + D{KL}(q(z_n|x_n, \psi), \mathcal{N}(z_n|0,1))$$

The brilliant insight is that gradients can now flow through the deterministic functions

$$h(x_n; \psi)\mu \; \text{and} \; h(x_n; \psi)\sigma$$

$$\text{while the randomness is isolated in} \; \epsilon_n$$

Source: Puchalski, Andrzej & Komorska, Iwona. (2023). Generative modelling of vibration signals in machine maintenance. Eksploatacja i Niezawodnosc - Maintenance and Reliability. 25. 10.17531/ein/173488.

Closed Form KL Divergence

KL divergence can be computed analytically when dealing with Gaussian distributions:

$$D_{KL}(q(z_n|x_n, \psi), \mathcal{N}(0,1)) \propto -\log h(x_n; \psi)\sigma + \frac{1}{2}h(x_n; \psi)\sigma^2 + \frac{1}{2}h(x_n; \psi)_\mu^2$$

Without the reparameterization trick, VAE training would be impossible. It’s the mathematical bridge that allows us to:

Maintain the probabilistic nature of the latent space
Enable gradient-based optimisation
Scale VAE training to complex, high-dimensional data

Simplified Loss

While the full VAE formulation with probabilistic outputs is mathematically elegant, practitioners often use a simplified version that’s easier to implement and train while maintaining most of the benefits.

Instead of the full probabilistic framework where we sample from the decoder output distribution, we can directly use the mean as our reconstruction.

$$\hat{x} = f(z_n; \theta^x)_\mu$$

This omits the variance term, transforming our complex probabilistic generation into a straightforward deterministic reconstruction.

Simplified Loss Function

$$\arg\min_{\psi,\theta^x} \mathbb{E}{x_n,z_n} \sum{d=1}^{D} (x_{n,d} - f(z_n; \theta^x)d)^2 + \sum_{k=1}^{K} \left[-\log h(x_n; \psi){\sigma_k} + \frac{1}{2}h(x_n; \psi){\sigma_k}^2 + \frac{1}{2}h(x_n; \psi)_{\mu_k}^2\right]$$

where:

The first term is a simple mean squared error (MSE) between the input and the reconstruction.
The second term is the analytical KL divergence regularising the latent space.

Why this works in practice

This simplified approach offers several advantages:

Computational Efficiency: No need to sample from the decoder distribution during training, reducing computational overhead.
Implementation Simplicity: The loss becomes a straightforward combination of reconstruction error and KL regularisation, much easier to code and debug.
Stable Training: Removing the stochastic sampling from the decoder output often leads to more stable gradients and faster convergence.
Retained Generative Power: The latent space still maintains its probabilistic structure through the encoder, preserving the model’s ability to generate diverse samples

VAE Algorithm

We have covered the most important mathematical aspects in the past sections; now, let’s explore how we can implement this programatically.

Data:
    •    D: Dataset
    •    q_φ(z|x): Inference model
    •    p_θ(x, z): Generative model
Result:
    •    θ, φ: Learned parameters
Algorithm:
(θ, φ) ← Initialize parameters
while SGD not converged do
    M ~ D (Random minibatch of data) 
    ε ~ p(ε) (Random noise for every datapoint in M)
    Compute L_θ,φ(M, ε) and its gradients ∇_θ,φ L_θ,φ(M, ε)
    Update θ and φ using SGD optimizer
end

Source: D. P. Kingma and M. Welling, “An Introduction to Variational Autoencoders,” FNT in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019, doi: 10.1561/2200000056.

Conclusion

Variational Autoencoders represent far more than just another neural network architecture—they embody a fundamental shift in how machines understand and create. By learning to compress the essence of data into structured latent spaces and probabilistically generate new samples, VAEs laid the mathematical groundwork for the entire GenAI revolution we’re witnessing today. While modern systems like ChatGPT, DALL-E, and Stable Diffusion have moved toward diffusion models and transformers for their impressive results, they all build upon the core insights pioneered by VAEs: the power of latent space representations, the importance of probabilistic generation, and the elegant mathematics of variational inference.

Key Takeaways

VAEs solved the fundamental challenge of teaching machines to generate rather than just classify.
The ELBO objective and reparameterization trick remain essential concepts in modern generative modelling.
Probabilistic latent spaces enable the diversity and creativity we see in today’s GenAI applications.
The mathematical principles behind VAEs continue to influence cutting-edge research in generative AI.

The Road Ahead

As we stand at the forefront of the GenAI era, understanding VAEs isn’t just about historical appreciation—it’s about grasping the mathematical DNA of machine creativity. Whether you’re a researcher pushing the boundaries of generative modelling, a developer implementing creative AI solutions, or simply someone fascinated by how machines learn to imagine, the principles embedded in VAEs will continue to guide the future of artificial creativity.

The next time you marvel at an AI-generated image or piece of music, remember: it all started with the elegant mathematics of variational autoencoders—teaching machines not just to think, but to dream.

Your 92% Accurate AI Model Might Be Dangerous (Here's Why)

Sagnik Das — Wed, 16 Jul 2025 22:00:00 GMT

The Problem We're Not Talking About

Your deep learning model can detect cancer from MRI scans with 92% accuracy. Impressive, right? But here's the uncomfortable question: Should doctors actually trust it?

As AI engineers, we often celebrate high accuracy scores as the ultimate win. But when I dug deeper into the epistemology (fancy word for "how we know what we know") of AI in medicine, I realized we might be solving the wrong problem entirely.

The Black Box Dilemma

Let's be honest about what we've built. Modern neural networks are essentially:

# Oversimplified, but you get the idea
def diagnose_cancer(mri_scan):
    # 50+ layers of transformations
    # Billions of parameters
    # Gradient descent magic
    return "cancer_probability: 0.92"

The problem? Even we don't really understand how this function works internally. Sure, we can trace the math, but can you explain to a doctor why the model flagged this specific scan?

This isn't just a UX problem—it's an epistemic crisis. In medicine, being right isn't enough. You need to be right for the right reasons.

Why "Trust Me, It Works" Isn't Enough

Imagine you're a doctor. An AI system tells you a patient has cancer, but you can't explain why. The patient asks: "How do you know?"

Your options:

"The computer said so, and it's usually right"
"I can see suspicious tissue patterns in regions X and Y that typically indicate..."

Option 1 might work for recommending movies, but it's ethically and epistemologically bankrupt in medicine.

The Reliability Trap

Some argue for computational reliabilism—basically, "if it works consistently, we should trust it." This sounds reasonable until you consider:

Edge cases: What happens when the model encounters something outside its training distribution?
Bias amplification: High accuracy on your test set might hide systematic biases
Accountability: When the model fails, who's responsible?

# This is what we often do
if model_accuracy > 0.9:
    deploy_to_production()

# This is what we should consider
if model_accuracy > 0.9 and model_is_interpretable() and bias_tested():
    deploy_to_production()

What This Means for AI Engineers

If you're building AI systems for healthcare (or any high-stakes domain), here's what to consider:

1. Build in Explainability from Day One

Don't treat interpretability as an afterthought.

# Example with SHAP
import shap

explainer = shap.Explainer(model)
shap_values = explainer(X_test)

# Now you can show which features drove the decision
shap.plots.waterfall(shap_values[0])

2. Design for Epistemic Transparency

Create systems where:

Confidence intervals are meaningful and well-calibrated
Feature importance is interpretable to domain experts
Decision boundaries can be explained in domain terms
Uncertainty quantification is built-in, not bolted-on

3. Collaborate with Domain Experts

Your model might be mathematically sound, but does it make medical sense? Partner with doctors to:

Validate that important features align with medical knowledge
Identify potential failure modes
Ensure explanations are clinically meaningful

The Bigger Picture

This isn't just about medical AI—it's about responsible AI development. As we build increasingly powerful systems, we need to ask:

Are we optimizing for the right metrics?
Can we explain our models' decisions to stakeholders?
Are we building trust or just demanding it?

Practical Takeaways

Accuracy is necessary but not sufficient for high-stakes AI
Interpretability should be a first-class requirement, not a nice-to-have
Domain expertise is irreplaceable—collaborate, don't replace
Epistemic humility is crucial—know what your model doesn't know

Final Thoughts

The next time you see that 92% accuracy score, ask yourself: "Would I trust this system to make decisions about my health?" If the answer is no, you've got more work to do.

Building AI that's not just accurate but genuinely trustworthy is one of the most important challenges in tech today. It's not just about better algorithms—it's about building systems that deserve the trust we're asking for.

What are your thoughts on explainable AI? Have you worked on interpretability in high-stakes domains?