Imagine for a moment, that we are on a safari watching a giraffe graze. After looking away for a second, we then see the animal lower its head and sit down. But, we wonder, what happened in the meantime? Computer scientists from the University of Konstanz's Centre for the Advanced Study of Collective Behaviour have found a way to encode an animal's pose and appearance in order to show the intermediate motions that are statistically likely to have taken place.
One key problem in computer vision is that images are incredibly complex. A giraffe can take on an extremely wide range of poses. On a safari, it is usually no problem to miss part of a motion sequence, but, for the study of collective behaviour, this information can be critical. This is where computer scientists with the new model "neural puppeteer" come in.
Predictive silhouettes based on 3D points
"One idea in computer vision is to describe the very complex space of images by encoding only as few parameters as possible," explains Bastian Goldlücke, professor of computer vision at the University of Konstanz. One representation frequently used until now is the skeleton. In a new paper published in the Proceedings of the 16th Asian Conference on Computer Vision, Bastian Goldlücke and doctoral researchers Urs Waldmann and Simon Giebenhain present a neural network model that makes it possible to represent motion sequences and render full appearance of animals from any viewpoint based on just a few key points. The 3D view is more malleable and precise than the existing skeleton models.
"The idea was to be able to predict 3D key points and also to be able to track them independently of texture," says doctoral researcher Urs Waldmann. "This is why we built an AI system that predicts silhouette images from any camera perspective based on 3D key points." By reversing the process, it is also possible to determine skeletal points from silhouette images. On the basis of the key points, the AI system is able to calculate the intermediate steps that are statistically likely. Using the individual silhouette can be important. This is because, if you only work with skeletal points, you would not otherwise know whether the animal you're looking at is a fairly massive one, or one that is close to starvation.
In the field of biology in particular, there are applications for this model: "At the Cluster of Excellence 'Centre for the Advanced Study of Collective Behaviour', we see that many different species of animals are tracked and that poses also need to be predicted in this context", Waldmann says.
Long-term goal: apply the system to as much data as possible on wild animals
The team started by predicting silhouette motions of humans, pigeons, giraffes and cows. Humans are often used as test cases in computer science, Waldmann notes. His colleagues from the Cluster of Excellence work with pigeons. However, their fine claws pose a real challenge. There was good model data for cows, while the giraffe's extremely long neck was a challenge that Waldmann was eager to take on. The team generated silhouettes based on a few key points – from 19 to 33 in all.
Now the computer scientists are ready for the real world application: In the University of Konstanz's Imaging Hanger, its largest laboratory for the study of collective behaviour, data will be collected on insects and birds in the future. In the Imaging Hangar, it is easier to control environmental aspects such as lighting or background than in the wild. However, the long-term goal is to train the model for as many species of wild animals as possible, in order to gain new insight into the behaviour of animals.
- Paper: Giebenhain, S., Waldmann, U., Johannsen, O., Goldluecke, B. (2023). Neural Puppeteer: Keypoint-Based Neural Rendering of Dynamic Shapes. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13844. Springer, Cham. https://doi.org/10.1007/978-3-031-26316-3_15
- In the publication entitled "Neural Puppeteer", they present a model for representing motion sequences of individuals based on just a few key points
- Bastian Goldlücke is professor of computer vision at the University of Konstanz
- Urs Waldmann is a doctoral researcher at the Cluster of Excellence "Centre for the Advanced Study of Collective Behaviour"