Can Apache Spark “slip its earthly bounds and go serverless, clusterless”? Rose Toomey, a software engineer at Coatue, is coming to Scale By the Bay to give us an idea of what serverless Spark could look like and get inspired by the future trends of Big Data.
In this interview, Rose spoke with us about her frustrations running Spark on servers, the effort to thrive in an ever-changing environment and why she calls her talk subject a "moonshot".
Welcome to Scale By the Bay! Please tell us more about yourself: how did you get interested in Big Data and what attracts you most in the space?
I’m a serial early stage fintech employee and currently I’m a Software Developer at Coatue Management. History may be written by the victors, but databases and APIs were written by good intentions. Getting value out of data is the highest friction and most interesting problem of our times.
What's the biggest challenge that you face in your work and how are you addressing the challenge?
I’m going to flip this question on its head: the biggest work challenge I face is actually within. Not just keeping up to date with such a large and rapidly changing field. But even more so to free myself from the tyranny of what’s useful. It can feel like a luxury to wander out of the trenches and pursue approaches which might not have an immediate application. But the value of this exploration is so disproportionate to the time it takes! I’m forcing myself to lift my scope beyond the horizon of immediate delivery.
What's the biggest thing that is misunderstood about Spark and its impact on serverless?
Spark can feel like a beast from a different era. There are plenty of frustrations running a Spark job on servers even when EMR or Databricks do some of the heavy lifting. Early serverless implementations have been slow, high touch, and specialized. I’m not here to tell you it’s any different today. But astronauts went to the moon using technology with limitations we find inconceivable. The work to bring serverless Spark to fruition - to develop alternative schedulers, non-disk dependent shuffle services – are improvements that can come back to benefit the whole ecosystem.
What are the three trends that will shape the future of Big Data?
All my impulses to answer this question – easier validation, better metadata stores, historically reproducible results - were so earthbound I can't even. So I'm coming to Scale By the Bay to be inspired with something more visionary.
What will you talk about at Scale By the Bay and why did you choose to cover this subject?
I’m talking about serverless Spark! Somewhat humorously called “moonshot” due to the remote and inhospitable environment for running Spark. This talk combines my interest in GraalVM with my interest - as a backend developer who has always worked on servers - with the nascent possibilities that serverless has to offer serious backend applications.
Who should attend your talk and what will they learn?
My talk is for anyone interested in how the Spark ecosystem evolves over the next few years. It’s not about the why of finding a real production application. It’s about how serverless Spark could look, under the constraints of a serverless environment. We’ll also look at what advantages GraalVM could offer. Come prepared to experiment and edge past limitations, it’s all tremendous fun!
Anything else you'd like to add?
So excited about the presentations at this year’s conference! Bill Venners on semantic contracts, that’s a must-see. Also looking forward to the Quill + Doobie presentation and especially Shoumik Palkar on Weld. When Jon Pretty says “revolution”, I snap to and salute back. And I will be sorry to miss Evan Chan’s talk on Scala and Rust which has been scheduled – tragically – opposite mine!