Quick case study of Dall-e 2 and reproducing the real world

Daryl Autar
5 min readMay 22, 2022
AI generated image of Rotterdam by Dall-e 2 from OpenAI
Ceci n’est pas Rotterdam

Most people will scroll past quickly, assume the above is a photo of Rotterdam and carry on with their day. But you’re not most people :) Look closer and you’ll notice things are off. The ‘photo’ was generated by Dall-e 2: the text to image generator that has everyone in AI hyped up and everyone in creative roles fearing for their jobs.

The text prompt given to Dall-e 2 to generate photos was: “Tilt shift aerial view of Rotterdam city during summer anno 2020”

In this writeup, I’ll go over the elements of the text prompt one-by-one and evaluate how well Dall-e 2 was able to deliver it.

Overall impression of how well it was able to draw Rotterdam from memory

Reference image of Rotterdam in 2015 from a roughly similar angle, cropped from source: https://commons.wikimedia.org/wiki/File:Rotterdam_Blick_vom_Euromast_auf_Kop_van_Zuid_1.jpg

What it got right

I have to say, I am quite impressed that it was able to generate the 2 main landmarks: the Erasmus bridge (aka the Swan) and the ‘de Rotterdam’ tower (weird geometric shaped building). Also in terms of positioning, it knows the bridge goes on the water, and tower goes across the river.

What it got wrong

It knows the overall shape of buildings, but it’s quite bad at the details, such as windows. Also, most other buildings/towers are just randomly placed here and there and given random textures. It sees the Erasmus bridge as an independent building-structure that does not necessarily need to be connected to roads. Interesting result, since from looking at the roads in the photo, Dall-e 2 does seem to understand the concept of roads leading to somewhere. It just discontinues them as soon as they are occulted by other objects. Object permanence seems to be a bit of a challenge.

Breakdown of the other text prompt elements

Tilt shift: I wanted to compare it with a tilt shift filter from photo editing apps like Instagram:

Instagram tilt shift edited reference image of Rotterdam from a roughly similar angle, cropped from source: https://commons.wikimedia.org/wiki/File:Rotterdam_Blick_vom_Euromast_auf_Kop_van_Zuid_1.jpg

I don’t see a clear winner between Dall-e 2 and Instagram, both look just meh. In the Instagram version, it has to do with the sharpness of the input image and the perspective. I’ve seen the Instagram filter produce way better results. Then again, here’s a better result from Dall-e 2 on a different tilt shift prompt as well.

Aerial view: I expected it to be zoomed out more, like so:

Aerial view of Windhoek, Namibia, Brian McMorrow, CC BY-SA 2.5, via Wikimedia Commons https://commons.wikimedia.org/wiki/File:Windhoek_aerial.jpg

The above is a typical aerial view photo perspective. It’s clearly taken from high up in the sky. Dall-e 2 only generated a view from closer to the ground, which is not what we usually mean with the term ‘aerial view’. This could imply that Dall-e 2 either hasn’t seen many aerial view perspective photos in training. Another more likely explanation could be that it has seen those type of photos, but does not associate them with the label ‘aerial view’.

Summer: The generated photos look sunny indeed. Take a look at the generated photo below. Buildings even have darker parts consistently on one side to indicate shade being thrown from a light source. Also, there’s even reflections generated in the water for the larger structures!

Dall-e 2 generated result of Rotterdam #2

Anno 2020: I chose 2020 because at that time the Zalmhaven tower (a 70-story skyscraper) was being built. I wanted to judge the ability of the model to discern time consistency. We could verify this if the photos showed the Zalmhaven tower under construction. (If the building is completely missing: 2017 or earlier. If the building is shown fully completed: 2022.) Alas, it’s hard to judge if the tower is missing entirely, or whether it’s misplaced, but present with a different texture. Although this new tower is quite an icon as the largest building in the Netherlands, the AI behind Dall-e 2 is likely biased by the training data. Due to its recency, the Zalmhaven tower just has way fewer pictures of it available, so it does not prioritize its shape, color and position. (Unlike it did very well with the Erasmus bridge and the ‘de Rotterdam’.)

Diversity of results

All 4 results generated by Dall-e 2, https://twitter.com/bakztfuture/status/1525285497626628097

I usually see very diverse results for any single Dall-e 2 prompt. In this case, all 4 results seem to be taken from the same point of view and the same part of town. I expected it to return some other iconic views and landmarks as well, like Rotterdam’s Euromast tower or city center.

Final thoughts

The first version of Dall-e went viral in 2021, but was not accessible to check ourselves (thanks ‘Open’AI). Now with Dalle-2 they have only given access to a few lucky people, so I was very skeptical when I saw the mind-blowing results. For many of the prompts making rounds on Twitter it’s impossible to say whether results are generated by a human artist or Dall-e 2. *conspiracy voice* What if they just hired a floor with illustrators to keep the hype going?

Seeing Dall-e 2 struggle with the results from my text prompt removes any doubt whether it’s generated by an algorithm. The above photos have obvious flaws. It can surely be considered artwork, but if the job is to represent the city accurately, then it did a poor job. At the same time, I’ve spent plenty of time in Rotterdam, yet if anyone asked me to draw the city from memory, I’m sure I would not outperform Dall-e 2. In fact, my strategy would also be to spend a lot of time on the few obvious landmarks and just fill in the rest with random stuff.

Dall-e 2 is nothing short of mind-blowingly impressive for text prompts asking to generate a few objects. It does very well when not having to generate real-world accurate results. I do see room for improvement w.r.t. labeling the training data to get even better results when generating photos containing many objects. In the meantime creatives needn’t hang up their pencils yet.

Huge shoutout to Bakz T. Future, who is one of the lucky few with access to Dall-e 2 and let me submit this prompt.

--

--

Daryl Autar

Founder of Imagine AI and Co-Founder of Wavy Assistant. Former AI lead and consultant, now leverages AI for positive social impact. Wins hackathons, a lot.