People have an early understanding of the laws of physical reality. For example, babies have expectations about how objects should move and communicate with each other, and will be surprised if they do something unexpected, such as disappearing into a handy magic trick.
Now MIT researchers have designed a model that shows a good understanding of some basic “intuitive physics” about how objects should behave. The model can be used to build smarter artificial intelligence and in turn provide information to help scientists understand baby knowledge.
The model, called ADEPT, observes objects that move around a scene and makes predictions about how the objects should behave based on their underlying physics. While tracking the objects, the model outputs a signal on each video frame that corresponds to a “surprise ” level – the larger the signal, the greater the surprise. If an object ever dramatically does not match the model's predictions – for example, by disappearing or teleporting over a scene – the surprise levels will rise.
In response to videos showing objects that move physically plausibly and unbelievably, the model registered surprise levels that corresponded to levels reported by people who had watched the same videos.
“By the time babies are 3 months old, they have the idea that objects are not blinking in and out of existence and cannot move or teleport,” said lead author Kevin A. Smith, a research scientist in the Department of Brain and Cognitive Sciences (BCS) and a member of the Center for Brains, Minds and Machines (CBMM). “We wanted to capture and formalize that knowledge to build baby knowledge into artificial intelligence agents. We are now almost approaching people such as the way in which models can distinguish elemental improbable or plausible scenes. ”
Together with Smith on the paper, co-prime authors are Lingjie Mei, a student at the Department of Electrical Engineering and Computer Sciences, and BCS research scientist Shunyu Yao; Jiajun Wu PhD '19; CBMM researcher Elizabeth Spelke; Joshua B. Tenenbaum, a professor of computational cognitive sciences and researcher in CBMM, BCS, and the Computer Science and Artificial Intelligence Laboratory (CSAIL); and CBMM researcher Tomer D. Ullman PhD '15.
ADEPT relies on two modules: a “reverse graphics ” module that captures object representations of raw images, and a “physics engine ” that predicts future representations of the objects from a distribution of possibilities.
Inverse graphics basically extracts information from objects – such as shape, pose and speed – from pixel input. This module captures video frames as images and uses inverted graphics to extract this information from objects in the scene. But it doesn't get stuck in the details. ADEPT requires only some approximate geometry of each shape to function. This helps in part to generalize the model predictions about new objects, not just those on which they have been trained.
“It doesn't matter if an object is a rectangle or circle, or whether it is a truck or a duck. ADEPT simply sees that there is an object with a certain position that moves in a certain way to make predictions, “says Smith. “Likewise, young babies don't seem to care much about some of the traits, such as shape, when making physical predictions.”
These coarse object descriptions are entered into a physics engine – software that simulates the behavior of physical systems, such as rigid or liquid bodies, and is often used for films, video games and computer graphics. The researchers' physics engine “pushes the objects ahead of time,” Ullman says. This creates a set of predictions or a “belief distribution” for what will happen to those objects in the next frame.
The model then observes the actual next frame. Again, it captures the object representations, which it then aligns with one of the predicted object representations from its division of faith. If the object obeyed the laws of nature, there will not be much mismatch between the two representations. On the other hand, if the object has done something unbelievable – let's say it disappeared from behind a wall – there will be a big mismatch.
ADEPT then again takes samples from its distribution of beliefs and notes a very low probability that the object had simply disappeared. If the chance is small enough, the model registers a big “surprise” as a signal peak. In short, surprise is inversely proportional to the probability that an event will occur. If the chance is very low, the signal peak is very high.
“If an object goes behind a wall, your physics remains convinced that the object is still behind the wall. If the wall goes down and there is nothing, there is a mismatch, “Ullman says. “Then the model says:” There is an object in my prediction, but I see nothing. The only explanation is that it has disappeared, so that is surprising. ”
Violation of expectations
In developmental psychology, researchers are conducting tests for “violation of expectations,” showing infants pairs of videos. One video shows a plausible event in which objects adhere to their expected views about how the world works. The other video is the same in all respects, except that objects behave in a way that somehow violates expectations. Researchers will often use these tests to measure how long the child looks at a scene after an incredible action has taken place. The longer they stare, researchers assume, the more they may be surprised or interested in what has just happened.
For their experiments, the researchers created different scenarios based on classical development research to investigate the core knowledge of the model. They employed 60 adults to view 64 videos of known physically plausible and physically unbelievable scenarios. Objects, for example, will move behind a wall and when the wall falls, they will still be there or they will disappear. The participants assessed their surprise at different times on an increasing scale from 0 to 100. The researchers then showed the same videos to the model. In particular, the scenarios examined the ability of the model to capture notions of sustainability (objects appear or disappear for good reason), continuity (objects move along connected paths) and sturdiness (objects cannot move together).
ADEPT suited people particularly well with videos where objects moved behind walls and disappeared when the wall was removed. Interestingly, the model also corresponded to surprise levels on videos that did not surprise people, but perhaps should have been. For example, in a video in which an object moving at a certain speed disappears behind a wall and immediately comes out on the other side, the object may have dramatically accelerated when it went behind the wall or it could be teleported to the other side. In general, people and ADEPT were less sure about whether that event was surprising or not. The researchers also discovered that traditional neural networks that teach physics from observation – but do not explicitly represent objects – are much less accurate in distinguishing surprising from non-surprising scenes, and their choices for surprising scenes do not often match people.
The researchers then plan to elaborate on how babies observe and learn the world, with the aim of incorporating new findings into their model. Studies show, for example, that babies up to a certain age are not really surprised when objects change completely in some ways – for example, when a truck disappears behind a wall, but pops up like a duck again.
“We want to see what else needs to be built in to make the world more like babies, and formalize what we know about psychology to build better AI agents,” says Smith.