-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 1:10:59
Humans are social animals. Perhaps this is why we so enjoy watching movies, TV shows and YouTube videos, all of which show people in action. A central problem for artificial intelligence therefore is to develop techniques for analyzing and understanding human behavior from images and video.
I will present some recent results from our research group towards this grand challenge. We have developed highly accurate techniques for reconstructing 3D meshes of human bodies from single images using transformer neural networks. Given video input, we link these reconstructions over time by 3D tracking, thus producing "Humans in 4D" (3D in space + 1D in time). As a fun application, we can use this capability to transfer the 3D motion of one person to another e.g. to generate a video of you performing Michael Jackson's moonwalk or Michelle Kwan's skating routine.
The ability to do 4D reconstruction of hands is a source of imitation learning for robotics and we show examples of reconstructing human-object interactions. In addition to 4D reconstruction, we are also now able to recognize actions by attaching semantic labels such as "standing", "running", or "jumping". However, long range video understanding, such as the ability to follow characters' activities and understand movie plots over periods of minutes and hours, is still quite a challenge, and even the largest vision-language models struggle on such tasks. There has been substantial progress, but much remains to be done.