How MissingLink.ai enables DeepOps to streamline machine learning
Machine learning and deep learning often require huge quantities of data to yield actionable insights. Unfortunately, all that data can overwhelm conventional tools. That results in frustrated data scientists, slower experiments, and delayed results.
There is a better way to implement deep learning, says MissingLink.ai’s Yuval Greenfield. During a presentation at the Open Data Science Conference in Boston, Yuval described DeepOps— a framework for deep learning operations spearheaded by MissingLink.ai, which is part of the Samsung NEXT product team.
DeepOps borrows from the world of DevOps, the well-established methodology for speeding up ordinary code development while ensuring better quality. As Yuval explained, DeepOps for AI development means marshaling the right tools and fostering a community of developers to embrace that solution.
The deep learning bottleneck
After surveying more than 100 AI development companies, Yuval found a common thread: Data scientists spend about a third of their time just building the deep learning infrastructure they need, rather than running deep learning experiments.
“They’re held together by a shoestring that someone just cobbled-up in a few days,” Yuval said about many of the tools data scientists typically use. “Iterations become slower because of it, and eventual time to market is heavily affected,” he added.
All software developers need to run code on machines. For ordinary development, the infrastructure isn’t any more complicated than that, Yuval explained.
Data scientists and developers working on machine learning experiments have to wrangle lots of data and build multiple iterations of their models. They also have to keep track of the results of those experiments, all of which requires a different approach to development than conventional coding.
The hardware challenge
Data scientists typically end up cobbling together mashups of local and cloud-based resources to do their work, Yuval explained. “It’s tricky to get these things consistent between your local machine, the GPU in the office, or that remote cloud machine,” he said.
With multiple people competing for the same resources, scheduling can also become an issue. Yuval sees developers resorting to manually updating spreadsheets to schedule time on the GPUs used for deep learning applications.
The irony surely isn’t lost on the developers creating some of the world’s most advanced technology. “GPUs,” Yuval said, “are being managed by a glorified notepad.”
And then there’s the problem of collaboration.
The collaboration challenge
Yuval pointed to research by Burtch Works revealing that 17 percent of data scientists change companies every year. That makes continuity a real problem.
“Folks are going to leave your team,” Yuval said, which means data scientists need ways to hand-off work to new people. Even for those who remain at a company, challenges abound for sharing progress and collaboration so they can effectively build on the work of other team members.
Yuval talked to developers at one company where the process was to write down the results of their experiments on physical sheets of paper. But, he noted, that’s not an efficient way to run machine learning experiments.
“You can’t share these notes,” said Yuval. “No one can later pick them up. They’re basically garbage once they’re off your table.” But the biggest problem of all for relates to the tools available for handling massive amounts of data.
The data challenge
Yuval discussed two main challenges when it comes to working with data for deep learning. First, the sheer amount of data makes it difficult to modify or move from one place to another.
“Maybe you have a ton of experiments that you want to do. Are you just going to duplicate your data all over the place?” Yuval asked. Given the hardware challenge, that’s generally not practical.
Secondly, adjustments to data skew the results of experiments. That makes it difficult to know what caused different results: updates to data or models. Version control is the answer to this problem, Yuval said, but many companies lack version control for their data.
The combination of hardware, collaboration, and data challenges all slow down the development process and put a strain on resources. But what if there were a way to apply the lessons of DevOps to machine learning?
Learning from DevOps
Today’s breakneck pace for software development — in which some companies push out new versions of their software hundreds of times daily — has developers moving much faster than in the past.
“There needs to be some way to innovate constantly, and to keep production stable,” he explained. That framework is called DevOps, which coordinates development and operations to keep applications running even as they incorporate new code.
More than methods or tools, culture is the key to making DevOps work. And organizations can apply that culture to AI development.
“So, what is this culture?” he asked. “What are its core tenets?”
First is strict attention to version control. “This is the most basic building block of DevOps,” said Yuval. “You need to know who did what and why so you can have discussions around these changes.” After that comes a focus on testing, automation, and processes for monitoring everything needed to keep a project on track.
“These core tenets were basically what made DevOps what it is — a super-fast, very powerful way to manage what’s in development and what goes to production,” Yuval said. For developers, applying these principles to machine learning and deep learning begins with the right tools.
The DeepOps solution
Solving the challenges of hardware, collaboration, and data requires automation, Yuval said. An automated job queue can help data scientists efficiently share hardware. The system then determines which projects to schedule, when, and using what resources.
In terms of collaboration, automation can help teams track and share information about their work.
Finally, automation can ensure version control is used for data management and processing.“The ideal solution would be a data lake,” Yuval said. In addition to applying version control, he added, the data lake should hold related files and stream them quickly whenever they are needed.
MissingLink is solving some of these deep learning challenges. But Yuval cautioned the audience that the people doing the coding need to be part of the solution. “It’s not about shoving tools down anyone’s throats,” he said.
Looking ahead, Yuval and his team are working to spread the gospel of DeepOps. Their hope is that it will help keep data scientists happy doing what they do best, rather than trying to build their own infrastructure. “They’re going to do a lot more, a lot faster,” Yuval said.