A robot face mapped its own muscles, then lip-synced from online videos

What would it take to make a humanoid convincingly believable and only to find that a couple of millimeters of silicone at the edge of a mouth were the most difficult part of it?

Image Credit to depositphotos.com

Researchers in the Creative Machines Lab of Columbia University constructed a humanoid robotic face in their engineering lab that is capable of creating oddly natural lips movement to make speech and sing, without consulting a set of phonemes to do so, but simply by watching. The humanoid system addresses a long time area of weakness in humanoid design: human beings are used to pardoning fumsy walk patterns and fumsy hands, but respond very sensitively to facial movement that appears even a bit off. It is that sensitivity that is one of the causes of “uncanny valley” effects that can be found in otherwise advanced robots.

The important step made with the project is to address facial control as a physical skill to be learned instead of an animation issue that has to be scripted. The robot face has 26 smaller motors miniaturized and covered with soft synthetic skin to provide it with sufficient degrees of freedom in order to create finer grained shapes of the mouth. Rather than initiating a library of mouth postures to represent each sound, the team started with “self-exploration”: the robot was placed before a mirror, ran through a thousand random faces and was taught about the relationship between internal motor-controlled commands and external geometry. It did not start studying human examples until after constructing that internal map and was trained on hours of speech and singing on YouTube to learn statistical associations between audio and visible lip motions.

That two-phase pipeline – self-model then, imitation then – finds itself in a group of methods that the group refers to as a vision-to-action model, in which perceptual inputs are directly converted into motor outputs. It is important since articulation of speech cannot merely be a succession of discrete forms; it is a rapid, sustained dance where delays are evident. On unconventional utterances, the face was able to coordinate its movement of the lips in several languages and even sing on an AI created album called Hello World, albeit with difficulty in hard bilabials such as “B” and pucker sounds such as “W”. “The more it interacts with humans, the better it will get, said Hod Lipson.

On the engineering side of the work, as well, it is implied that realism is currently limited not just by hardware of faces but also by algorithms. Compliant skin, actuation at low noise, and control bandwidth to manage transitioning phoneme are all necessary to coordinate lip motion, otherwise even good models will have puppet artifacts. That is one reason why locomotion and manipulation have historically been of particular interest to the field: legs and hands are challenging, but also more readily measured in terms of a task. People judge faces, quickly and usually mercilessly.

Measurement of design levers of human evaluation. In a controlled study, the facial features of a social-robot were under control; it was revealed that, round eyes, and an upturned or neutral mouth made a person to trust and feel more positive towards a robot in a sample size of 211 participants. Mechanics and motion, in other words, are of the nature of mere aspects of a “believability; setting expectations which motion must meet.

The larger question is what lip synchronization opens in interaction as it does as it gets better. In a study where Yuhang Hu, the head of the study, has posited that realistic facial movement can be combined with conversational AI, connectivity can be enhanced since facial micro-gestures apply in parallel with the words. The research literature on the same also cautions that individuals tend to personify social robots, occasionally to develop one-directional emotional relations such dynamics are explained in the context of anthropomorphism and virtual interaction framing for social machines.

The technical milestone is, therefore, not just a robot that lip-syncs the internet video. It is an exhibition of the fact that faces may be trained like bodies: by self-calibration, observation and iteration- and the social implications of that realism show up just in time.

Leave a Reply

Your email address will not be published. Required fields are marked *