Codec avatars are all but indistinguishable from the humans they portray, as Facebook researchers term them, and could be a staple in our virtual lives sooner than we think.
"THERE'S THIS BIG, ugly sucker at the door," said the young woman, twinkling her eyes, "and he said, 'Who do you think you are, Lena Horne?' I said no but that I knew Miss Horne like a sister."
It's the beginning of a brief soliloquy from Walton Jones' play The Radio Hour of the 1940s, and it's easy to see as she proceeds with the monologue that the young woman knows what she's doing. As she recounts the doorman's change of tune, her grin expands, like she's letting you in on the joke. Her lips curl, playing with their cadence, as she seizes on just the right words. Her movements are so finely tuned, her reading so confident, that you'd think you were watching a black-box revival of Broadway's late-70s play with the dark backdrop behind her.
For years now, through avatars, computer-generated characters that represent us, people have been interacting in virtual reality. Since VR headsets and hand controllers are trackable, the unconscious mannerisms that add vital texture bring our real-life head and hand gestures into those simulated conversations. Yet, even as our interactive experiences have become more naturalistic, it has forced them to remain visually simple by technological constraints. Social VR applications such as Rec Room and Altspace abstract us into caricatures, with expressions that seldom (if ever) map our faces to what we actually do. From your social media images, Facebook's Spaces can create a realistic cartoon approximation of you, but relies on buttons and thumb-sticks to activate those phrases. It is a long way from being able to make an avatar feel you, even a more technically challenging platform like High Fidelity, which allows you to upload a scanned 3D model of yourself.
That's why I'm here in Pittsburgh on a ridiculously chilly morning in early March inside a house that very few outsiders have ever walked into. Finally, Yaser Sheikh and his team can let me in on what they have been working on since they first leased a small office in the East Liberty neighborhood of the city. (They have since moved to a larger space near the Carnegie Mellon campus, with hopes to expand again in the next year or two.) As Facebook Reality Labs calls them, Codec Avatars are the product of a mechanism that captures, learns, and re-creates human social speech using machine learning. They're nowhere near being ready for the public, either. If they end up being something that Facebook deploys at all, they're years away. But the FRL team can kick off this discussion. "It'll be big if we can get this finished," Sheikh says with the not-at-all-contained grin of a man who knows that they're going to get it done. "We want to get it out. We want to talk about it."
Anthropologist Edward Sapir wrote in the 1927 essay "The Unconscious Patterning of Behavior in Society," that people respond to gestures "under an elaborate and secret code that is written nowhere, known by none, and understood by all." Ninety-two years later, replicating that elaborate code has become the ongoing task of Sheik.
Yaser Sheikh was a Carnegie Mellon professor before he came to Facebook, studying the intersection of computer vision and social perception. Sheikh didn't hesitate to share his own vision when Oculus chief scientist Michael Abrash reached out to him in 2015 to explore where AR and VR could be headed. "The real promise of VR," he says now, both hands around an ever-present coffee mug, "is that instead of flying to meet me in person, you could put on a headset and have this exact conversation that we're having right now—not a cartoon version of you or an ogre version of me, but looking the way you do, moving the way you do, sounding the way you do."
(Sheikh described it as a "social presence laboratory," in his founding document for the facility, a reference to the phenomenon in which your brain reacts to your virtual surroundings and interactions as if they are actual. Then again, he also wrote that within five years, using seven or eight people, he figured they could achieve photorealistic avatars.
What Sheikh calls the "ego test" and the "mom test" is easy and double the principle underlying Codec Avatars: love your avatar, and so should your loved ones. As I discovered for myself during two separate capture procedures, the process allowing the avatars is something far more complicated. The first takes place in a domelike enclosure called Mugsy, with 132 Canon lenses off-the-shelf and 350 lights centered on a chair, the walls and ceiling of which are studded. Sitting in the middle feels like standing in a paparazzi-made black hole. "I had awkwardly named it Mugshooter,"I had awkwardly named it Mugshooter. "Then we realized it's a horrible, unfriendly name." That were a few versions ago; in both cameras and capabilities, Mugsy has gradually increased, sending early kludges to deserved obsolescence (such as using a ping pong ball on a string to help participants keep their face in the right position, car-in-the-garage-style).
Research participants spend about an hour in the chair at Mugsy, making a series of outsize facial expressions and reading lines out loud while an employee coaches them via webcam in another room. Keep your teeth clenched. Unwind. Relax. Show all your teeth. Unwind. Relax. Scrunch your complete face up. Unwind. Relax. "Suck your cheeks in like a fish,"suck your cheeks in like a fish. "Puff your cheeks."
If the word panopticon comes to mind, it should be a larger dome known internally as the Sociopticon, but it would be best applied to the second capture area. (Before joining Oculus/Facebook, Sheikh established its predecessor, Panoptic Studio, at Carnegie Mellon.) The Sociopticon looks a lot like Microsoft's Mixed Reality Capture Studio, albeit with more cameras (180 to 106) that are also high-resolution (2.5K by 4K versus 2K by 2K) and capture a higher frame rate (90Hz versus 30 or 60). Where Mugsy focused on your face, the Sociopticon enables the Codec Avatar device to understand how our bodies and our clothes move. So my time is less about facial expression and more about what I'd call Lazy Calisthenics: shaking limbs, jumping around, playing charades via webcam with Belko.
The point is to capture as much information as possible (every second, Mugsy and the Sociopticon collect 180 gigabytes) so that a neural network can learn from every angle to map gestures and movements to sounds and muscle deformations. The more data it receives, the stronger its "deep appearance model" becomes and the better we can train it to encode the data as data, and then decode it as an avatar on the other end, in the headset of another person. As someone who struggled in the early days of the internet with video encoding problems knows, this is where the "codec" comes from in Codec Avatars: coder/decoder.