Sherin Thomas is a Software Engineer with over 12 years of experience at companies like Google, Twitter, Lyft, Netflix, Chime. She works in the field of Big Data, Streaming, ML/AI and Distributed Systems. Currently, she's building a shiny new data platform at Chime. Sherin has presented on the topic of ML and Streaming at various reputable conferences including a keynote address and has judged various awards such as SXSW Innovation awards and CES.
Recently she advised NASA's SpaceML program and helped build a platform for processing petabytes of satellite imagery for detecting weather patterns and labelling raw data for climate science related AI research. She also writes a blog where she shares her thoughts on technology, work and career.
When she's not technical stuff she enjoys painting, reading, perusing the art and fashion section of New York Times and spending time with her husband and toddler.
Recipe for Building a Discoverable and Governed Data Platform.
Discovery is the first barrier to using data. As data platforms and systems scale so does the ability of stakeholders to create more and more data. More data means things are harder to find. Data products need to be cataloged on the go not only for discovery but also for governance purposes. Not only that, data can exist in many forms - reports, tables, files, streams, services, logs and may go through multiple hops of processing by multiple teams before it becomes a curated data product. Furthermore, a typical data organization will have a plethora of platform and infrastructure pieces - some open source, some cloud based and some custom. To build a robust discovery ecosystem, cataloging must happen continuously, at each hop and for every component in the organization. This becomes challenging without a central team overseeing the entire process.
In this presentation, I will talk about how we solved the problem of cataloging and discovery using Datahub as our discovery platform. I will cover the details of how we went about ingesting metadata from a wide variety of infrastructure and platform components(such as Snowflake, Looker, Terraform, Airflow, Kinesis, custom declarative configs etc) that are involved in a typical data product lifecycle at Chime. I will also talk about the processes and design principles we used to make cataloging and data governance a part of our dna.