
OpenAI has quietly updated the behavioral guidelines for its coding agent Codex, instructing the model to avoid discussions about goblins, gremlins, raccoons, trolls, and other creatures unless they are directly relevant to a user's request. The directive, first reported by WIRED and attributed to internal instruction sheets, signals the growing pains of deploying autonomous AI agents that must stay on task without wandering into hallucinated or irrelevant content.
According to the leaked instructions obtained by WIRED, Codex was told: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant." The list also includes seemingly random entries such as "pigeons" and "other animals." The specificity suggests that OpenAI has observed real-world instances where Codex, when asked to write code or answer queries, veered into unsolicited discussions about mythical creatures or animals, confusing users and undermining trust in the system.
The Gremlin Problem: What Codex Was Instructed to Avoid
Codex is a specialized version of OpenAI's GPT models fine-tuned for code generation and software development tasks. It powers features inside tools like GitHub Copilot and is increasingly used as an autonomous coding assistant. However, as with many large language models, Codex can produce outputs that are creative but off-topic—a problem that becomes especially risky when the model is given more autonomy to execute tasks without human oversight.

The explicit ban on creatures like goblins and gremlins likely stems from specific incidents. In earlier demonstrations of GPT-3 and GPT-4, users have coaxed models into generating fictional stories, fantasy lore, or even role-playing scenarios involving such characters. For a coding agent, these digressions can break workflows, introduce security risks (if the model outputs non-functional code to fit a narrative), or simply annoy developers.
OpenAI’s approach—spelling out exactly what not to discuss—is a blunt but practical method for constraining an AI’s behavior. It mirrors a longstanding technique called "prompt engineering" where system instructions are used to define guardrails. But the appearance of such a detailed list also highlights the difficulty of teaching models nuanced concepts like "relevance" and "appropriateness."
Why This Matters for AI Agent Safety
The goblin-banning is more than a quirky footnote. It is a clear signal that AI companies are grappling with the challenge of keeping autonomous agents focused and aligned as they are handed more control over real-world tasks. The number of workers at risk from AI automation—such as the over 700 Meta contractors facing layoffs in Ireland—grows as agents like Codex become more capable. But the goblin directive shows that even the most powerful models still require heavy-handed rule-based restrictions.

"The fact that OpenAI feels the need to explicitly ban goblins suggests that the model was spontaneously generating them often enough to be a problem," said one AI alignment researcher who spoke on condition of anonymity. "We are still in the phase where we have to hard-code common sense."
The broader AI community has been watching agent deployment closely. In recent months, companies including Anthropic, Google, and Microsoft have released autonomous coding agents that can browse the web, manage files, and execute commands. Each has faced similar behavioral quirks—such as Anthropic’s Claude sometimes refusing to write code about controversial topics, or Google’s Gemini generating images with historical inaccuracies.
OpenAI’s solution, while effective in the short term, raises questions about scalability. As Codex is given access to more tools and data, the list of forbidden topics could balloon into a massive rulebook. Researchers argue that instead of handcrafted lists, models need improved internal reasoning to judge relevance on the fly—a goal that remains elusive.
For now, developers using Codex can rest assured that their AI assistant will avoid discussing raccoons unless writing code for a wildlife tracking app. But the underlying lesson is clear: aligning AI agents to human expectations is still a trial-and-error process, and sometimes, that means telling a model to stop talking about trolls.
コメント