Updated: Sep 6, 2019
Anne DeCusatis started as a backend engineer with no prior knowledge of Scala and is now a Data Infrastructure Engineer at Spotify. They are now working on a cross-functional team of six people solving data quality specific problems. Anne also co-founded MergeSort, a New York City's feminist hackerspace, supporting non-binary people and women in tech.
At Scale By The Bay, Anne, together with Idrees Khan, Senior Data Engineer at Spotify, will talk about the quality of data flowing through a pipeline and how Spotify’s infrastructure team focused on data quality addresses this problem, making cultural changes and building specific tools. In advance of the talk, we spoke to Anne about their Scala journey, Spotify’s autonomous culture, why more data can cause more problems and the key trends that are shaping the future of Data and Data Pipelines.
Welcome to Scale By the Bay! Please tell us more about yourself: how did you get interested in Scala and Data Pipelines and how it lead you to your role at Spotify?
Over the three years that I spent at my previous job as a backend engineer, I went from knowing no Scala when I was hired to leading introductory Scala workshops for new hires. I eventually felt that I hit a plateau - I was still writing Scala on the backend every day, but not deepening my knowledge of it on a daily basis anymore. At the same time, I was also becoming more and more convinced that the problems that were being solved with data engineering were important and interesting problems, and I was curious about how the data engineers I worked with used data to help make better personalized experiences possible for users of the product I worked on. At Spotify, I’ve had the opportunity to grow my Scala expertise by working on infrastructure that impacts many teams, rather than the one feature at a time I was working on before. I’m also glad to have the chance to get my hands dirty with the practical aspects of using data at such a large scale.
What's your current role and what exciting things are you working on at the moment?
Currently, I’m a Data Infrastructure Engineer at Spotify. I work on a cross-functional team of about six people focused on solving data quality specific problems within a larger “data infrastructure” organization. Right now, we’re iterating on several tools internally, including validation and profiling tools that interoperate with data pipelines written with Scio. These are exciting for technical reasons, such as our use of Magnolia for typeclass derivation, but that’s not the only exciting thing about them.
What’s really important about these validation and profiling tools, and what makes me excited to work on them, is that they fill a need that data engineers at Spotify have. We also have ongoing maintenance work, on both internal projects and open source projects like Ratatool. “Maintenance work” sounds boring but I love having the chance to learn more about the problems people want our tools to solve and to help make that happen for them.
What's the biggest challenge that you face in your work and how are you addressing the challenge?
The culture at Spotify is different than at other places I’ve worked. We’re very autonomous and initiatives are often driven from the bottom-up (rather than top-down). I believe that everyone I work with has good intentions and wants to build the best product we can, and our different perspectives make us stronger. But sometimes it can be a challenge to get others to understand our perspective, or to understand theirs.
Currently, we’re working to increase adoption of our tools, and understanding the perspective of why someone would want to invest time in using our tools, or why they might need to focus on other product or maintenance work instead, is an important exercise. We’re addressing this by trying to be good listeners! For one small example, we recently stopped posting GitHub pull request links in our public Slack channel, because we found that it became noise that drowned out questions that other teams had for us.
What's the biggest thing that is misunderstood about Data Pipelines?
I think that a misconception that people working with data often have is that collecting more data will help them solve their problems. More data can help solve problems, but it can also cause problems if you don’t handle it appropriately; for example, if it’s not clearly owned and well maintained, or if it contains users’ personal information and isn’t kept secure. When I first joined Spotify I worked on some of our GDPR compliance infrastructure, and so I’m very aware that owning large amounts of user data has both pros and cons, and you have to be careful with the data you collect.
What are the trends that will shape the future of Data and Data Pipelines?
Understanding your data is already important, and I think that will only become more true as time passes. AI is very trendy right now, and a machine learning model is only as good as the cleanliness and quality of the data that feeds it. Trends like: users seeking control of the data that they produce in your ecosystem, and: measuring the interpretability of the models that use your data as features (both for the consumers and producers of the models) matter if you want to build technologies that work well for the people using them.
What will you talk about at Scale By the Bay and why did you choose to cover this subject?
Our talk is called “How to Eliminate Surprises In Your Data.” I think that data quality is fundamentally important to building useful pipelines at scale, and it’s also what I work on every day. I wanted to share Spotify’s approach so that we could start a conversation about what other companies are doing with data quality initiatives, and where we as an industry can go in the future.
Who should attend your talk and what will they learn?
I don’t want anyone to be scared off from attending our talk, so I want to mention that even though I work primarily in Scala, we won’t be doing any deep dives into Scala internals.
If you operate data pipelines at scale and don’t already have infrastructure that’s specifically targeted at data quality, the talk can give you a sense of where to get started. We’ll also talk about the cultural changes we’re working on to prioritize data quality. If you work on data pipelines, you probably work with other people, and I want you to feel empowered by our talk to take ownership of the culture you’ve built around quality.
Anything else you'd like to add?
I’m also a member of the NYC steering committee for the LGBTQ+ employee resource group at Spotify. Speaking only for myself, one of the reasons I’m a data engineer is because I feel that it’s important to consider the impact that your choices as an engineer will have on the people whose data you’re using, especially those from already marginalized communities. I won’t be talking about this on stage but I would love to hear what people at Scale By the Bay are doing to be responsible stewards of data. Come find me in the hallway!