Back in 2000, Vishakha Gupta-Cledat took her father's advice to pursue her career in Computer Science. Now, the Founder and CEO at ApertureData, she is striving to help companies navigate through the complexities of managing visual data.
At Scale By the Bay, Vishakha will present ApertureData, a Data Management solution that redefines how large visual data sets are stored, searched and processed. In advance of her talk, we spoke with Vishakha about the biggest miscomprehension about Machine Learning, the growth of visual data and why she decided to start her company.
Welcome to Scale By the Bay! Please tell us about yourself and ApertureData.
I am the Founder and CEO of ApertureData, the Data Management solution that redefines how large visual data sets are stored, searched and processed. Prior to founding ApertureData, I was at Intel Labs for over 7 years where I led the design and development of VDMS (the Visual Data Management System) which forms the core of our product, the ApertureData Platform.
I have a Ph.D in Computer Science from the Georgia Institute of Technology and a M.S. in Information Networking from Carnegie Mellon University. My research interests encompass designing and building computer software systems with a focus on hardware/software co-design and implementation for solving big data problems. I have worked on graph-based storage and applications on non-volatile memory systems. I love working on systems which impose stringent requirements in terms of software design and coding and call for innovative solutions.
How did you get started in Computer Science and what was the turning point when you founded your own company?
I actually decided to go for a Bachelors in Computer Science back in 2000 because my father had heard from someone that it was the new cool degree to get! Over time, I really came to enjoy all that I was learning, so much so that I decided to pursue a Masters at CMU and then a PhD at Georgia Tech where my research focused primarily on virtualization technologies (which were up and coming at the time).
During my PhD, I collaborated closely with researchers from Intel where I then started my first job in a group focused on virtualization technologies. I enjoyed working with people with a very wide breadth of knowledge (from hardware architectures, to compilers to large scale applications). Over time, as virtualization technologies matured, our lab’s focus shifted to infrastructure for end to end visual processing and I led the design and development of the open source Visual Data Management System which now forms the core of ApertureData’s product.
I founded ApertureData some time after I relocated from Portland to Seattle and it when became hard for me to sustain continuously commuting and leaving my two daughters at home. The work we had done at Intel seemed to have a lot of potential and I wanted to take it further; we had some internal validation at Intel and later on my conversations with other customers just solidified my plans to put my vision for the product to the test. I got the support (with some caveats) from my management chain at Intel and my husband at home, and I launched ApertureData.
What exciting things are you working on at the moment?
Given that we are such a small team, as the Founder and CEO, I do whatever needs to be done. If a customer query doesn’t work, I debug it. If there is a long list of features to be built, I add some. If there is a new customer that could benefit from our Platform, I work on convincing them why they should absolutely give us a try. I interview people to hire, and I make sure that we are well-funded and secure for our future.
What's the biggest challenge that you face in your work and how are you addressing the challenge?
Our company is going to offer an on-premise and SaaS version of the ApertureData Platform, an end-to-end visual data management platform for scalable Machine Learning.
The market that we are targeting lies at the intersection of big-“visual”-data and Machine Learning. This market is poised to become very large very fast but it is only getting noticed now.
My biggest challenge is, therefore, to make people see how this is a challenge in their industry (even if they do not fully realize it yet). Most people realize they have a problem but haven’t fully circumscribed it and haven’t really thought of what tools are out there to solve it. Frequently, they deploy hacks on top of their current solutions.
To address this, I usually paint a picture of the typical workflow of one of their highly compensated data scientist or ML engineer when they are handed huge amounts of image/video files as well as the associated metadata (labels, origin, etc). The amount of work required to utilize this data and train a model to extract business value is very large. Enterprises already understand the challenges of setting up teams of experts who can extract value from a data pile and, by walking through the typical workflow from data to business value, enterprises realize that when such a team is deployed, they currently end up spending months navigating through the complexities of managing/engineering visual data and metadata dumped by various visual sources in order to start turning it into gold for every new problem they are asked to solve. Given the increasing scale of ML deployments, these data engineering efforts can end up costing millions of dollars to corporations due to loss of productivity and efficiency.
Once that point is understood, it is relatively easy to help companies imagine the benefits of a data management platform that could unify the storage of data and its associated metadata through an AI-friendly and easy-to-use interface, making it possible for such teams to find what they need when they need it, and in the form they need it. It could save months at a time for these teams and millions of dollars for the enterprises.
What's the biggest thing that is misunderstood about Machine Learning and Visual Data Management?
About 80% of data on the internet is now visual (images or videos) and, with the development of medical imaging and self-driving cars for example, the amount of visual data is going to continue growing. Machine Learning, specifically Deep Learning, has seen unprecedented growth in the last decade but most of the efforts have gone into improving compute efficiency and the accuracy of recognizing content and not towards data ingestion. As more and more enterprises adopt Deep Learning for understanding data and improving business efficiencies, managing large amount of representative visual data now poses the next big challenge at every stage of their ML lifecycle, from collection and curation, to extracting valuable business insights. The current misconception is that as long as ML algorithms gain accuracy and performance, data can be pipelined or its access costs can somehow be hidden.
We could not find a solution that offered as rich of an interface and the set of functionalities as ApertureData Platform offers, as attested to by our current users. Any other solution required putting together an assortment of systems for metadata, image and video serving as well as preprocessing (e.g. OpenCV), and it was particularly hard to find a persistent index to perform similarity search on feature vectors. Even after putting the systems together, the API was unclear and any makeshift solution was typically not optimized for Machine Learning (e.g. pre-processing support). What other solutions fail to address is that once AI is ready to be commercialized, managing the onslaught of real visual data is going to be a killer for real deployments. Providing a simple interface so that people could continue with their AI and data science research will avert the data management crises, which frankly, a lot of companies are already starting to experience.
What are the three trends that will shape the future of the space?
Increasing adoption of Machine Learning across various application domains; explosion in the amount of visual data collected, with the intention of training ML with emphasis on more representative data; and advances in memory, storage, and networking technology to keep up with the vastly growing demands are the key trends that will shape the future of the space. ApertureData Platform will benefit from each one of these trends given we are a scalable data management platform designed for visual machine learning, and our Intel roots have already prepared us to exploit benefits of new hardware technology.
What will you talk about at Scale By the Bay and why did you choose to cover this subject?
In my talk, I will present the motivations behind the ApertureData Platform as well as some details of its workings. I will use examples based on our customer interactions to explain the need for a unified interface that allows users to easily store visual data along with its associated metadata, enrich this metadata continually, and explore whether AI is ready to perform similarity searches or preprocess the data where it resides before being ingested by some machine learning pipeline. I will demonstrate how our platform not only makes it easy to manage visual information but is also well-prepared for the stages of AI deployment and insights gathering that follow data collection, curation, and annotation.
I believe that this subject is apropos as the major focus of the ML community has been more on compute relegating data management to a second tier. Enterprises are now starting to realize that data management is a problem they can no longer ignore.
Who should attend your talk and what will they learn?
Data Scientists, Platform Engineers, and ML developers who want to use Deep Learning to extract insights from their collection of images and videos such as companies working on medical imaging AI, autonomous driving, satellite imaging, entertainment, industrial manufacturing, and so on will enjoy the talk.
If you are interested or working in one of these categories, I invite you to come attend the talk. You will take away a realization that data management can be made efficient and intuitive, regardless of the scale. I also hope to demonstrate that improvements in Machine Learning won’t just come from the compute pipeline but also stem from how the data is managed.
Anything else you'd like to add?
We are currently hiring. If you are passionate about solving the scaling problem for big- “visual”-data problems and enjoy diving deep into setting up and optimizing deep learning pipelines for visual data, we’d love to hear from you.
Don't miss Vishakha Gupta-Cledat and her talk Machine Learning’s Missed Opportunity in Visual Data Management at Scale By The Bay this November. Book your spot today.