Putting the “data” in data scientist
Data is the fuel that powers the modern world. It has virtually limitless potential, from steering autonomous vehicles to helping medical researchers explore cures for cancer. Unfortunately, most data is useless without the ability to properly manage, interpret, and extract value from it.
Data scientists across the world face this challenge as they examine increasingly larger and more unruly data sets. Jesse Freeman, chief evangelist at MissingLink.ai, has been working with his colleagues to solve this dilemma with a deep learning tool that makes data management faster, simpler, and more scalable.
Jesse recently presented at the Open Data Science Conference in Boston, where he discussed some of the common problems data scientists face and how the MissingLink.ai platform aims to reduce and eliminate them.
“We’ve developed all these tools for new software development, and we want to bring that to scientists,” Jesse says. “We talk about putting the data in data sciences because we want to free them up instead of managing everything by hand – to automate it on a day-to-day basis.”
Tackling data science’s biggest challenge
Jesse illustrated the company’s mission using ChestXray14, one of the largest publicly available chest X-ray datasets. Released by the National Institutes of Health Clinical Center (NIHCC) in 2017, it includes more than 65,000 anonymized X-ray images from more than 30,000 patients, many of whom had advanced lung disease.
IDC predicts that the world’s data volume will grow from 33 zettabytes in 2018 to 175 zettabytes by 2025 – more than five times its current size. Jesse discussed the paradox of today’s data scientist: they have troves of invaluable data, but limited tools and resources to realize its full potential.
For example, ChestXray14’s data can be a starting point to train models to detect cancer and other chest diseases from X-ray images. One deep learning team used heat maps to detect pneumonia. However, on a broader scale, there are myriad complications on the path to accomplishing these goals.
“We talked to a hundred different AI companies, and one of the things we noticed was that if we look at what’s going on under the hood, it can be a tangled mess,” he says. “Sometimes the data scientist is doing all this by hand. Sometimes they don’t understand how to operate the machines, because it’s too hard or complicated to keep them configured. Sometimes they just don’t want to do commits every time you change one line of code.”
Using DeepOps to automate data analysis
To tackle these issues and continue the global tradition of democratizing technology for the masses, MissingLink.ai has conceived a set of practices and tools it calls “DeepOps.” Jesse described it as being like DevOps, but for the deep learning world.
“DeepOps, as we call it, is the core concept of what we’re trying to teach the deep learning community,” says Jesse. “How do we take the same tools and workflows that we use in DevOps when we automate for development and continuous integration? Things that allow us to write code, test it, and send it out and make sure that it’s valid before it goes out to production, without having to bother an engineer.”
DeepOps operates on a few core tenets. First, says Jesse, data must be collected, cleaned up, and labeled systematically. It must also be stored in a way that allows for proper versioning and selective manipulation of specific types and categories of data.
Next, the code must be managed to ensure that it’s always versioned and can be readily compared to previous experiments with no roadblocks in analyzing and understanding various results. Finally, the computer must manage all of this data and code automatically, to minimize the risk of human error and relieve data scientists of the cumbersome manual performance of these tasks.
Enabling data protection, evolution, and exploration
MissingLink.ai accomplishes these DeepOps tenets focusing on three fundamental processes: data protection, evolution, and exploration. To ensure that private data stays protected in the right hands, the platform manages client data without touching or moving the data from customer environments.
Data evolution functionality enables scientists to track how their data changes over time. For example, X-ray labels from different radiologists may differ, requiring updates to maintain consistency and avoid confusing the models during training. MissingLink.ai tracks these changes so that scientists can revert and test previous versions.
Moreover, the platform’s data exploration features help scientists perform advanced data slicing. They can, for instance, leverage deep learning algorithms to readily query and access select data without writing separate scripts, like analyzing X-ray images for female patients aged 18 to 55.
MissingLink.ai also offers capabilities for syncing metadata tracking in between syncs, and cloning to a cloud or local computer. Iterators stream each X-ray image to either source as needed, versus the manual approach of processing all 45 gigabytes of images on a local computer.
Jesse and his team see no reason for data scientists to be held back by the bottlenecks of manual data management. MissingLink’s deep learning solution automates the tedious data organization and code development processes so that scientists can focus on the tasks that require a human touch.
After all, a developer’s time is better spent training AI on the nuances of identifying cancerous X-ray markings than on skimming code for bugs.