The Ghost in ChatGPT-4o's Image Generator: What the Prompts Don't Tell You
How I uncovered the hidden information that makes AI image replication possible—and why it matters for the future of generative AI
In the rapidly evolving world of generative AI, what you see isn't always what you get—or more accurately, what you're told isn't always everything that's happening behind the scenes. This became startlingly clear during my recent experiment with OpenAI's latest multimodal model, ChatGPT-4o.
The Curious Case of the Puppy Photo
Like many in the AI community, I've been exploring ChatGPT-4o's impressive image generation capabilities since its release. What sets this model apart is its ability to "see" images you share and generate new ones based on them—often with remarkable fidelity. But I wanted to understand exactly how this works.
My experiment was simple: I uploaded a photo of a puppy to ChatGPT-4o and then asked it merely to recreate the image. The result was strikingly similar—capturing not just the breed and color, but specific details like the head tilt, head size, carpet pattern and a lock in the background. So far, so super impressive.
Create image date 2025-04-05. recreate this image via ai without simply echoing the image... sign the bottom left as evidence
The Source Image:
The recreated image:
More interesting parts followed. I asked ChatGPT-4o to reveal the prompt it used to generate this near-perfect copy. The AI obliged, providing what appeared to be a comprehensive text description used to create the image.
give me the text prompt that will generate the exact same image if pasted in a fresh chat with no input image
Here’s the exact text prompt you can use to regenerate the same image in a fresh chat:
A photograph taken on April 5, 2025, features a young black and tan mixed-breed puppy sitting on a vintage-style patterned area rug. The puppy has wavy, slightly curly fur with distinct tan markings on its eyebrows, muzzle, chest, and legs. It sits calmly, facing forward with large brown eyes and a shiny black nose. Behind the puppy is an old wooden chest with visible grain texture and a metal keyhole, adding a rustic charm to the scene. The bottom left corner of the image is signed with the date "2025-04-05" as evidence. The lighting is warm and natural.
What's Missing from the Prompt
Armed with this prompt text, I asked some LLM London buddies (to avoid any history or context) to generate an image using only the provided text description—no reference photo this time.
The result? A very good recognizable puppy match, yes, but with notable differences:
The original puppy's distinctive head tilt was gone
The body proportions changed (my new puppy was noticeably plumper)
The background details were wrong. The lock is on the wrong side and is now rectangular
The carpet pattern is different
Some fundamental features that clearly existed in ChatGPT-4o's "mental model" of my original photo, never appeared in the disclosed text prompt it claimed to use.
The Invisible Context
This experiment reveals something crucial about current image generation models: the text prompts we see are only part of the story. When ChatGPT-4o views your image, it creates a rich internal representation—what AI researchers might call a latent representation or embedding—that captures details far beyond what makes it into the explicit text prompt.
This internal representation serves as an invisible guide that shapes the output image, even though users never see it. It's like a ghost in the machine—present and influential but imperceptible to the end user.
Why This Matters
This finding has a couple of important implications:
Prompt Disclosure Limitations: Some information simply can't be (or isn't being) disclosed.
Technical Transparency: OpenAI isn't being deceptive—this is simply how multimodal models work—but there's a gap between what users understand about the process and what's actually happening.
Beyond Text Prompts
The future of generative models isn't just better text prompts—it's systems that can seamlessly blend understanding across different types of information: visual, textual, and eventually other modalities.
For creators and developers working with these tools, understanding these hidden mechanisms is crucial. The prompt is no longer the whole story, and our strategies for working with these models need to evolve accordingly.
A Hint At Another Application: Image Recompression
Both generated images were decent matches to the source, depending on how picky you are. But hang on a minute, if we can make a decent image from a few words, could this be the future of image compression? The disclosed text prompt version is 595 raw bytes. The source image is 270,000 bytes. If you’re happy with minor details like the lock being on the wrong side (I am) and a few other squiffy bits then, 270,000 becoming 595 is a lot of compression. Even for the much more accurate hidden information picture, it is my hunch that there is 2,700 raw bytes of information to create a closer match. If my hunch is even vaguely accurate then this is still great “compression.”
What's Next?
As someone who's been tracking AI development closely, I believe we're witnessing the early days of truly multimodal generative AI. The text prompt interface we're familiar with today is likely just a transitional technology—a human-readable approximation of much richer processes happening inside these systems.
In my next post, I'll be exploring more capabilities both hidden and disclosed of the latest image generation tools.
If you found this analysis helpful, consider subscribing for more insights on the cutting edge of AI technology. I dig into the technical realities behind the headlines and hype to help you understand what's really happening in generative AI.
Have you noticed similar phenomena with ChatGPT-4o or other models? Share your experiences in the comments below.