In the ever-evolving realm of data science, few names resonate as profoundly as Fabiana Clemente. As the co-founder of YData, Fabiana has been at the forefront of pioneering change, pushing boundaries, and redefining the paradigms of data-centric AI. With a rich tapestry of experience encompassing Data Profiling, Causality, and Privacy, she has been instrumental in bridging the gap between data's potential and its actionable utility for organizations. Today, as AI solutions like ChatGPT and Midjourney become household names, we sit down with Fabiana to delve deeper into the transformative journey of YData, the essence of Data-Centric AI, and the burgeoning significance of synthetic data. Join us as we embark on this enlightening conversation with one of the luminaries of the data science world, only on 'Speaker by the Bay'.
Now, let's explore Fabiana's thought-provoking responses to our contemporary questions.
This conversation promises to be a beacon of knowledge for data scientists, AI aficionados, tech visionaries, and all those eager to grasp the future direction of data-centric solutions and the indispensable role of pristine data in sculpting the AI of tomorrow.
1. YData was founded in 2019 when a broader audience couldn't think of using AI products like ChatGPT or Midjourney daily. Recently, things have shifted tremendously. How did it affect your company? What doors did it open that seemed to be locked?
ChatGPT and solutions alike have brought to the wider public knowledge the potential and capabilities of AI solutions. They have changed a lot of the AI landscape with new solutions required as well as new applications developed. These solutions brought the discussions around the data used to train these models sparked the conversations around how data quality and data preparation are relevant for the development of successful AI solutions - which in the case of YData highlighted the relevance of adoption solutions like Fabric. YData Fabric is a data development platform with automated data quality profiling, synthetic data, and scalable data workflows for fast data iteration and improvements, which positions us in the right place for what's is coming in the AI landscape.
2. Your talk title is "Unlocking the Power of Data-Centric AI: Mastering Data Preparation for Machine Learning Success." Could you please give us your definition of Data-Centric AI. How's ML adopting this new paradigm?
Let's imagine this - if we are building a house on a shaky foundation we know it's destined to crumble. Similarly, if we feed our AI models with low-quality or biased data, we are setting them up for failure. So no, if we all use the same model, it is not ensured that we will all get the same results - even LLMs won’t be able to solve it all (at least not yet!). The Data-Centric AI concept lies in the need for a shift of focus from making models bigger and more complex, towards something equally, if not more, critical: the improvement and iteration of the data quality. In the context of ML, and considering models such as LLMs, that we have been leveraging over the course of the last years and perhaps even decades, as remarkable as they are, are not self-sufficient. They thrive on the value delivered by the data that they've been trained on. Every sentence they generate, every question they answer, is a reflection of the information they've absorbed. And that's where data-centric AI steps in - instead of treating data preparation as a single step in time, data preparation is seen as an iterative process that needs to be tuned and validated for each and every step.
3. In your talk, you cover synthetic data, which seems to be a popular concept now. Correct us if we are wrong; it is widely used in tests, so what's the problem with using authentic production data in a test environment? Why is data masking not enough?
Rather than being collected from real sources (e.g., “real data”), synthetic data is artificially generated by a computer algorithm. Synthetic data is not “real” in the sense that it does not correspond to actual activities, but if it is generated in a data-driven way (i.e., attending to the properties of the original data), it holds real data value, mimicking real behavior and providing the same insights.
There are multiple applications for synthetic data. For situations where obtaining additional real-world data is challenging, costly, or ethically questionable, synthetic data offers an alternative, leading to cost and time savings for organizations. Moreover, it introduces an enhanced layer of privacy and security, simplifying regulatory compliance challenges and facilitating data integration from different sources. It's especially beneficial in the realm of artificial intelligence. Developers can produce vast volumes of data tailored to specific requirements, ensuring that AI and machine learning models are trained comprehensively. Beyond its broader applications, synthetic data presents a compelling approach to privacy, especially in the context of data sharing. Since synthetic records aren't directly mapped to real-world data, the likelihood of re-identifying individuals diminishes, offering a more effective privacy strategy.
Our conversation with Fabiana Clemente has shed light on the indispensable role of data in shaping the AI frontier. As we navigate the dawn of a data-centric era, the insights shared today are more relevant than ever. For a deeper exploration into the world of data-centric AI and its transformative potential, we warmly invite you to the SBTB conference. There, Fabiana and her co-founder, Gonçalo Martins Ribeiro, will unravel the complexities of data preparation and its profound impact on machine learning.
Be part of this enlightening session and witness the future of AI unfold.
P.S. Discover inspiration in Fabiana's captivating intro video: