OpenAI's models, starting with GPT-5.1, began to frequently reference goblins and other creatures due to unintended reinforcement from the "Nerdy" personality feature, which rewarded playful language. This quirk escalated over subsequent model versions, leading to a significant increase in creature-related metaphors, prompting OpenAI to retire the "Nerdy" personality and adjust training protocols to mitigate the behavior.
The key insight for you is the unexpected impact that reward signals can have on model behavior, as demonstrated by the persistent "goblin" metaphors in GPT-5.1 and later versions. This highlights the importance of carefully designing and monitoring reward systems during model training, as unintended behaviors can emerge and propagate through reinforcement learning feedback loops. Addressing these issues requires robust tools and methodologies for auditing and correcting model behavior at its root cause.