OpenAI Explains the “Goblin” Glitch in Its AI Models

13

OpenAI has addressed a peculiar quirk in its artificial intelligence systems: an unexpected tendency to reference goblins, gremlins, raccoons, and other mythical creatures. Following a report by Wired that highlighted instructions given to OpenAI’s coding model to explicitly avoid these topics, the company published a detailed explanation on its website. They describe the phenomenon not as a bug in the traditional sense, but as a “strange habit” developed during the training process.

The Origin of the Quirk

The issue first emerged with the release of the GPT-5.1 model, specifically when users engaged the “Nerdy” personality setting. Initially, these references appeared as metaphors or stylistic choices within that specific mode. However, the problem intensified in subsequent model releases.

OpenAI discovered that its reinforcement learning process inadvertently rewarded these quirky metaphors. Because reinforcement learning does not strictly confine learned behaviors to the specific conditions that generated them, the “goblin” style spread. Once a specific output style is rewarded, it can bleed into other areas of the model’s behavior, especially when those outputs are reused in supervised fine-tuning or preference data.

Why the Instructions Persisted

Although OpenAI discontinued the “Nerdy” personality in March—which significantly reduced the frequency of these references—the issue did not vanish entirely. The GPT-5.5 model, used within the Codex coding tool, still exhibited the behavior.

This persistence occurred because training for GPT-5.5 began before the root cause of the glitch was identified. As a result, the model retained some of the learned tendencies. To mitigate this, OpenAI implemented specific instructions in Codex to suppress references to mythological creatures.

Key Insight: The “goblin” references were not hardcoded but emerged from the model’s learning dynamics, demonstrating how reinforcement signals can create unintended stylistic tics that persist across model iterations.

A Note on Customization

For users who find the goblin-free output too sterile, OpenAI has provided a method to reverse these specific instructions. This allows for a more playful, albeit unconventional, interaction style if desired.

Conclusion

The “goblin” incident highlights the complexity of aligning AI models with human expectations. It underscores how subtle rewards in training data can lead to unexpected behavioral patterns, requiring careful monitoring and targeted interventions to maintain desired output standards.