With GPT-3, OpenAI has shown that a single deep learning model can be trained in such a way that it can complete or even create texts in a realistic way – simply by using the system receives a gigantic mass of text as start data. It then became clear that the same approach also works when texts are replaced by pixels: an AI could be trained to complete half-finished images. GPT-3 mimics how humans use speech; Image GPT-3 predicts what we will see.
OpenAI has now merged these two ideas and two new models called DALL Developed E and CLIP, each combining language and images in a way that helps AI better understand what words mean and what they refer to. “We live in a visual world,” says Ilya Sutskever, Chief Scientist at OpenAI. “In the long run you will have models that understand both text and images. AI will be able to understand language better, because technology will realize what words and sentences mean.”
GPT-3 doesn’t know what it’s doing – still With all the charm that GPT-3 has, what at comes out of the system, still listening to it rather unrealistic, as if it doesn’t know what it’s actually talking about. No wonder: it doesn’t either. By combining text with images, researchers at OpenAI and elsewhere are trying to give language models a better understanding of the everyday concepts people use to make sense of things.
DALL · E and CLIP approach the problem from different directions. At first glance, CLIP (short for “Contrastive Language-Image-Pre-training”) is just another image recognition system.
However, there is more to it here: The system has learned To recognize images not from appropriately named (tagged) examples from a data set curated by humans (as most existing models do), but from images and their subtitles from the Internet. It learns from a description what can be seen in a picture and not on the basis of a single term such as “cat” or “banana”.
Easier to generalize CLIP is trained by providing the correct description for a randomly selected selection from 32. 768 images is intended to predict. To achieve this, CLIP learns a wide variety of objects with the associated terms and words that describe them. It can then identify objects whose images are not part of the training set.
(Image: OpenAI)
Most image recognition systems are trained in such a way that they can identify certain types of objects – such as faces from surveillance videos or buildings in satellite images. Like GPT-3, CLIP can now generalize across tasks, without any additional training.
In addition, it is less likely than with other modern image recognition models that the system can be misled by contradicting images . Images that were only slightly changed would typically have confused algorithms, even if a person might not have noticed a difference.
Surreal results DALL · E (probably a play on words from the film title “WALL · E” and Dali), however, does not recognize any pictures, it paints them. The model is a reduced version of GPT-3 and was also trained with text-image pairs obtained from the Internet. With a short description in natural language – such as “picture of a water pig sitting in the field at sunrise” or “cross-sectional view of a walnut” – DALL · E generates a lot of photos that should correspond to this: dozens of water pigs in all sizes and shapes an orange or yellow background – and rows of walnuts (though not all of them in cross section).
The results are fascinating, but still a lucky bag. The description “Fogged glass window with the image of a blue strawberry” produces many accurate results, but also some with blue windows and red strawberries. Others do not contain anything that reminds of a window or a strawberry. In the results recently published by OpenAI, however, the raisins were not picked out manually, but were hierarchized by CLIP.
The model selected 32 DALL · E pictures for each of the descriptions that it believed to match the title. “Text-to-image is a challenge to research that has been around for a long time,” says Mark Riedl, who works on natural language processing (NLP) computational creativity at the Georgia Institute of Technology in Atlanta. “But this is a pretty impressive set of examples.”
Radish in tutu When the researchers wanted to test DALL · E’s ability to grasp new concepts, they gave the system descriptions of objects it had probably never seen before, like “an avocado armchair “and” an illustration of a baby radish in a ballet costume walking a dog “. In both cases, the AI generated images that combined these concepts in a plausible way.
In particular, the armchairs all look like avocado seats. “What surprised me most is that the model can take two independent concepts and combine them in a way that is ultimately functional in some way,” says Aditya Ramesh, who co-developed DALL · E. This is probably due to the fact that half an avocado is reminiscent of an armchair with a high backrest, with the core as a pillow. For other descriptions, such as “snail made out of a harp”, the results are less good – with images that combine snail and harp in a strange way. DALL E is the type of system NLP expert Riedl envisioned submitting it to the Lovelace 2.0 test – a thought experiment that he 2014 has developed.
(Image: OpenAI)
The test is intended to replace the Turing test as a benchmark for measuring the capabilities of an AI. He assumes that intelligence is also determined by the ability to bring concepts together in a creative way. Riedl believes that a better intelligence test is to ask a computer to draw a picture of a man holding a penguin and not to ask whether a chatbot – as in the Turing test – manages to appear human. Because the former is more open-ended and cheating is much more difficult. “The real test is to find out to what extent AI can be pushed out of the comfort zone,” says Riedl.
Recalls or create? “The ability of the model to generate synthetic images from rather strange concepts seems very exciting to me,” says Ani Kembhavi from the Allen Institute for Artificial Intelligence (AI2 ), who also developed a system that develops images from texts. “The results seem to obey the desired semantics, which I find pretty impressive.” Jaemin Cho, a colleague of Kembhavi, is also impressed: “Existing text-to-image generators have not shown this level of control in creating multiple objects or in the ability to use spatial logic like DALL · E,” he says.
But DALL · E also reaches its limits: If you add too many objects to a description, you lose track of what has to be drawn. And if you rephrase a description with words that actually mean the same thing, you sometimes get different results. There is also evidence that DALL · E is more likely to imitate images found online and not generate new ones. “I am a little mistrustful of the radish example, as stylistically it looks like that Art springs from the net, “says Riedl. He notes that a quick search uncovered some cartoons that show such anthropomorphic radishes.” GPT-3, on which DALL · E is based, is notorious for its memory performance. ” Nonetheless, most AI researchers agree that they are well on the way to making systems smarter using the CLIP and DALL · E approach by basing language processing on visual understanding “The future will consist of such systems”, says OpenAI chief researcher Sutskever. “And both of these models are a step towards such a system.”
(bsc)