A simple prompt: to ChatGPT 4o:
“create sn image of a man in his 40s looking at the sunrise, and the moon is also visible somehow”
Response:

Me: “Are there teo moons?”
AI: “No, there’s only one moon visible in the image, positioned alongside the sunrise to capture both the night and day atmosphere. If you’d like an image with two moons or any other adjustments, I can create a new version for you.”
Me: “what is 2 and 3? i see teo different objects”

AI: “
Thank you for clarifying with the image. It does indeed look like there are two different celestial objects in the sky, marked as 2 and 3 in your annotations.
• Object 2: This large object resembles a moon but is oversized compared to a realistic moon. It could be interpreted as a stylized or artistic representation of the moon.
• Object 3: This smaller object has the appearance of a second moon or another celestial body, giving the scene an otherworldly feel.
The image unintentionally includes two different moon-like objects, which may add a surreal or artistic element to the scene. If you’d prefer a more realistic sunrise with only one moon, I can create a new version for you.”
Me: “yes please 😎”
AI: “Here’s the revised image with a single moon visible along with the sunrise. Let me know if this captures what you envisioned or if you’d like any further adjustments.”

Me: “same issue again 😎”

AI:

etcetcetc 😂
Me: “Let’s take a meta-analysis here. You are an expert in multi-modal LLM’s. What could the root cause be here?”
AI: “
Great approach! Let’s break down the issue and analyze the root cause by considering how a multi-modal language model like DALL-E processes prompts and generates images.
Possible Root Causes of Repeated Misinterpretation
1. Semantic Ambiguity in Language-to-Image Translation:
• Language models, even multimodal ones, rely on probabilities to interpret prompts. Words like “moon” and “sun” are contextually close in some artistic interpretations (sunrise, celestial scenes). The model might be generating multiple celestial objects as it interprets “moon and sun” as an artistic or fantasy scene rather than a realistic one.
• In an effort to provide “balance” or “composition,” DALL-E might assume that “sunrise” with “moon visible” implies a surreal, otherworldly quality, which can lead to extra moons or planets for visual impact.
2. Data Bias from Training Set:
• The model was likely trained on a vast array of images from different styles, including landscapes with multiple moons or celestial objects (common in sci-fi or fantasy art). If the model encounters keywords like “sunrise” and “moon,” it may retrieve patterns from similar images in its training data where multiple moons or celestial elements are common.
3. Inherent Artistic Interpretation Tendency:
• Multi-modal LLMs like DALL-E are designed to interpret language creatively. In scenes involving the moon and sun, there’s often a creative bias to add extra elements to achieve “artistic depth.” The model might be generating extra objects because it “thinks” they add to the aesthetic appeal or storytelling aspect of the image.
• Additionally, without explicit instructions to limit to “realistic astronomy,” the model may opt for stylistic flourishes.
4. Difficulty in Reinforcing Negative Constraints:
• Language models often struggle with interpreting negative constraints (e.g., “only one moon” or “no extra objects”). While we specify “only one moon,” DALL-E might still interpret that as “prioritize one, but other celestial bodies are acceptable.” Models generally focus more on adding elements rather than removing or limiting them.
5. Positional Ambiguity in Composition:
• Models may struggle with interpreting where in the scene elements should appear relative to each other, especially when working with abstract ideas like “only one celestial body” in an expansive sky. The model might find it challenging to differentiate whether a user wants a single realistic scene or a creatively expansive one with multiple objects.
Possible Solutions
To address these root causes, here are a few strategies we could attempt:
1. Explicit Negative Prompting:
• Try using very explicit negative prompts like “only one moon and one sun in the sky; do not add additional planets, moons, or celestial objects.”
2. Simplifying the Scene:
• Reducing the complexity of the scene might help, for example, by focusing only on the sunrise with a single moon and removing additional context like the man’s expression or landscape, as these might inadvertently prompt the model to “fill in” with extra elements.
3. Using Post-Editing Tools:
• If DALL-E consistently interprets the scene incorrectly, editing the final image manually or with an image editing tool might be the most direct solution. Tools like Photoshop or GIMP can help remove extra moons or celestial objects.
4. Experimenting with Alternative Models:
• If available, using other image-generation models might provide a different interpretation. Some models, especially those trained on realistic images rather than art or fantasy, may adhere better to realistic prompts.
5. Iterative Prompt Refinement:
• Continue refining the prompt iteratively, perhaps by trying more minimal descriptions (e.g., “a single moon at sunrise in the sky”) and analyzing each output until the model produces an image closer to the desired outcome.
Let me know if you’d like to try any of these methods, or if there’s another approach you’d like to explore!”
Even though I did not get the image I wanted, I still had fun here!