Here's how to design a robot that can cook
Browse Blog Topics

Here’s how to design a robot that can cook

Artificial intelligence can help humans get a lot of things done. AI-powered devices, for example, can follow through on requests to order laundry detergent, vacuum floors, and park cars in tight spots, but they can’t make you dinner… yet.

It turns out that peeling vegetables and distinguishing cumin from coriander is more challenging than understanding language, navigating living room furniture, or measuring the distance between automobiles. The problem stems from the fact that an AI sous chef in the kitchen would need to understand how to interact with 3D objects the same way that humans do.

Srinath Sridhar, a computer science postdoctoral researcher at Stanford, is working to bring those type of human-like abilities to robots. Sridhar recently gave the presentation Deep Learning for Digitizing Human Physical Skills at the San Francisco Deep Learning Meetup.

Sridhar’s work focuses on digitizing human motion and skills from videos. This includes identifying human poses, estimating the shape of objects, and understanding the physics of the everyday world.

“The key characteristics that distinguishes humans from other animals is our ability to interact with diverse environments,” he says. “We might go to the market to buy some vegetables and skillfully manage and manipulate those vegetables. We can also use tools and machinery to build infrastructure, like buildings. We can interact with other humans through gestures or we can operate in our kitchens, and interact with the ingredients and the tools in kitchens to make a recipe.”

Beyond robots in the kitchen
The impact of Sridhar’s work extends beyond robotics to include virtual and augmented reality and prosthetics. To achieve the goal of understanding human interaction with 3D objects, Sridhar is trying to solve three distinct problems.

First, robots must be able to understand human poses to a fine degree, such as where fingers are located and the articulation of a hand. Understanding the environment and what objects are in the environment is the second major challenge. A robot needs to be able to distinguish a chair from a bookcase, as well as interpret the location of the objects, and how objects are positioned relative to each other. Finally, robots need to emulate how humans interact with objects in various environments.

Sridhar says all three problems are interconnected. “How do we figure out what kinds of interactions humans perform on objects and what are the customs involved in how I grab a coffee mug, for instance?” he says. “We also need to have a physical understanding of physical interactions that humans have with everyday objects, and the key thing here is we need to have a kind of developer understanding for all three of these components in three dimensions.”

Understanding real-world context
It’s not enough to understand images, says Sridhar, robots need to interpret the 3D world. Fortunately, much of the research into 2D image analysis is relevant to 3D. One way to understand and describe human poses in an image is by identifying the position of joints. Typically this is done using 21 sets of two-dimensional coordinates. Some researchers have used 2D coordinates and multiple images to build 3D models, but it takes too long to generate those 3D representations for real-time interaction analysis.

“What we want is something that works with consumer smartphone cameras or webcams,” says Sridhar. We want something that produces 3D results and we want something that is interactive something that is real-time.”

A better approach to identifying 3D coordinates of joints, he says, is described in a paper titled “VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera.” The idea is to use a deep learning neural network to produce two kinds of maps from images: location map and a heat map.

Location maps encode the 3D coordinates of where a joint appears. A heat map then indicates the probability of a joint appearing at a particular position in space. Combining the two into a composite yields a more accurate estimate of the location of a joint than using either method alone.

Tracking algorithms, which analyze data about figures in videos, follow the position of joints through a series of images. This allows for analysis of the way a human pose changes over the course of a video.

The second problem of understanding objects in the environment is solved with similar techniques that Sridhar says is described in “GANerated Hands for Real-Time 3D Hand Tracking From Monocular RGB.” The solution to the problem of understanding the environment and what objects are in the environment combines a convolutional neural network and a a kinematic 3D hand model.

The third problem, understanding objects as they relate to humans and other objects, is known as the category level pose estimation problem. Techniques for solving this problem are outlined in “Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation.”

Sridhar gives an example that may seem trivial to humans, but presents significant challenges for machine vision: how to identify individual objects in a cluttered scene, such as the picture of a desk covered with papers, books, pens, etc.

One way to solve this final challenge, Sridhar says, is to use models of common objects as references for identifying objects and their orientation in a scene. This is not always possible given the 10,000 to 20,000 categories of objects that humans recognize.

An alternative technique, called normalized object coordinate space, provides representations of clusters of objects, which, when combined with other information such as depth, can distinguish objects even in cluttered scenes.

Sridhar’s work promises to improve the ability of machines to recognize and manipulate 3D objects. While the technology is not quite there yet, it’s getting closer. In the meantime, humans will have to be content to cook their own dinners while robots play music and vacuum the floors.


For more details, watch his presentation on Youtube: “Deep Learning for Digitizing Human Physical Skills.” To find out about upcoming Samsung NEXT and MissingLink events, sign up for our weekly newsletter.

Related Stories