Published on November 16th, 2020 | by Emergent Enterprise0
This Could Lead to the Next Big Breakthrough in Common Sense AI
There are many new technologies that are changing the way the world does business (and everything else) and they become even more powerful when they are combined to achieve unprecedented outcomes. We last told you about the new language generation AI, GPT-3 in an August 6, 2020 post and today’s post reports on the mashup of GPT-3 and computer vision. Karen Hao shares this story at the MIT Technology Review about how GPT-3 and computer vision are combining to make AI smarter and more accurate. You don’t have to go far to experience AI making incorrect judgements such as interacting with an online bot so better performance will be welcome. Now AI not only “reads” content but it “sees” things and that makes it more intelligent all the time.
Photo credit: MS TECH | PEXELS
Researchers are teaching giant language models how to “see” to help them understand the world.
You’ve probably heard us say this countless times: GPT-3, the gargantuan AI that spews uncannily human-like language, is a marvel. It’s also largely a mirage. You can tell with a simple trick: Ask it the color of sheep, and it will suggest “black” as often as “white”—reflecting the phrase “black sheep” in our vernacular.
That’s the problem with language models: because they’re only trained on text, they lack common sense. Now researchers from the University of North Carolina, Chapel Hill, have designed a new technique to change that. They call it “vokenization,” and it gives language models like GPT-3 the ability to “see.”
It’s not the first time people have sought to combine language models with computer vision. This is actually a rapidly growing area of AI research. The idea is that both types of AI have different strengths. Language models like GPT-3 are trained through unsupervised learning, which requires no manual data labeling, making them easy to scale. Image models like object recognition systems, by contrast, learn more directly from reality. In other words, their understanding doesn’t rely on the kind of abstraction of the world that text provides. They can “see” from pictures of sheep that they are in fact white.
AI models that can parse both language and visual input also have very practical uses. If we want to build robotic assistants, for example, they need computer vision to navigate the world and language to communicate about it to humans.
But combining both types of AI is easier said than done. It isn’t as simple as stapling together an existing language model with an existing object recognition system. It requires training a new model from scratch with a data set that includes text and images, otherwise known as a visual-language data set.
The most common approach for curating such a data set is to compile a collection of images with descriptive captions. A picture like the one below, for example, would be captioned “An orange cat sits in the suitcase ready to be packed.” This differs from typical image data sets, which would label the same picture with only one noun, like “cat.” A visual-language data set can therefore teach an AI model not just how to recognize objects but how they relate to and act on one other, using verbs and prepositions.
But you can see why this data curation process would take forever. This is why the visual-language data sets that exist are so puny. A popular text-only data set like English Wikipedia (which indeed includes nearly all the English-language Wikipedia entries) might contain nearly 3 billion words. A visual-language data set like Microsoft Common Objects in Context, or MS COCO, contains only 7 million. It’s simply not enough data to train an AI model for anything useful.
“Vokenization” gets around this problem, using unsupervised learning methods to scale the tiny amount of data in MS COCO to the size of English Wikipedia. The resultant visual-language model outperforms state-of-the-art models in some of the hardest tests used to evaluate AI language comprehension today.