Updated: Nov 7, 2019
Previously a Principal Engineer and Director at Facebook where he founded the Product Infrastructure team and co-created GraphQL, Nick Schrock is the founder of Elementl, a company that is building the next generation of open-source data tools.
In advance of Nick's talk at Scale By the Bay where he will talk about Dagster: a framework for building modern data applications, we caught up with Nick and discussed his career at Facebook, starting Elementl, developing an open-source project Dagster, and the future of data and data infrastructure.
How did you get interested in data and what are the key highlights of your career?
The bulk of my career was spent at Facebook on the Product Infrastructure team, where I was a founding member. We did not deal with ETL or ML - data management, broadly. Our job was to make Facebook product developers more efficient and productive. Our projects were initially for internal consumption only, but we eventually spun some of them out into open source projects. Examples include React - which I did not directly contribute to - and GraphQL, which I co-created.
I left Facebook in early 2017 and while figuring out what to do next, I asked many companies - in Silicon Valley and more traditional companies - what their biggest technology issues were. Data management, data science, and data engineering kept coming up as a huge technical liability.
I began to investigate the field and found pockets of genius and greatness, but also a field that felt like it was a decade behind in terms of development tools and practices. Testing is not a norm; modern software development practices are not used; the developer experience and workflows are quite broken; the systems are too fragmented and siloed; and so on. This leads to a general sense that folks are doing repetitive work and building repetitive infrastructure.
You worked as a Principal Engineer and Director at Facebook for many years. What are your key learnings from that experience?
There are so many lessons to share. Pete Hunt (another ex-Facebooker) and I put together a 20 episode podcast series with Jeff Meyerson from Software Engineering Daily about this very subject. Here's an episode that was published recently that tries to sum up the experience.
You are the founder at Elementl. Can you please tell us more about Elementl and what you are working on?
Elementl’s first project, Dagster is an open-source, Python framework for building data applications, a category comprised of ML processes, ETL pipelines, and others. With Dagster you can define your data application as a graph of functional computations; annotate them to be self-describing in tooling, make them more testable; deploy and schedule those computations; and develop and monitor those applications using beautiful, modern tooling. You can use computational runtimes such as Spark, Pandas, and Jupyter Notebooks and deploy to arbitrary infrastructure.
Our long-term vision is to build a commercial, hosted data management and developer platform that leverages the adoption of Dagster in the broader ecosystem.
What are your thoughts on Open Source and why is it important to you?
I believe Open Source is the right model for building developer tools in today’s world, especially frameworks that have an intimate relationship with a developer’s code and workflow. What do I mean by intimate? For example, a framework that co-exists in the same process, that calls into your code within that process, that is one of the primary API surfaces for work, and that is a defining element in your daily workflow. GraphQL and React fall into this category. Dagster does as well.
How did the idea of Dagster come about?
The core insight that spawned Dagster was that I was ruminating about the question “what are the unique properties in the data domain?” This question came from noting that if you take a developer and move her from a traditional application to a data pipeline, her behavior changes a lot. The context and constraints are totally different.
Unlike in traditional apps, there is a one-to-one correlation between a data asset and the computation that produces it. Meaning that if you have a file in a data lake or a partitioned table in a data warehouse, in nearly all cases there has been a single computation - or graph of computations - that has produced this thing over time.
What if we focused more on the computation that produced the asset rather than the asset itself? Give the computation a queryable type and metadata system. Standardize the way it is configured, invoked, and monitored and put all of those capabilities behind a well-structured API. At the same time, it’s making your users more efficient and productive. Once you put a real identity around the computation, it becomes more powerful than the identity of the data asset itself. The data assets are derived from the computation.
So that was the original insight: let’s take the construction of this computation seriously. It’s not just a random, unstructured script of python stitched into a graph late in the process. The DAG should be front and center, it should be testable, describe itself, and be operable and configurable over an API and rich tooling.
What are the key challenges you are facing in your work and in developing Dagster? How are you addressing these challenges?
When building something new and interesting, the biggest challenge is ensuring that it 1) delivers real value incrementally and practically in the short term while not compromising the long term vision and 2) communicating the value proposition of a fundamentally new thing clearly to users. This is challenging because you are on that journey of exploration as well, and the software will inevitably change, so you need to find the right early users who also align with the vision.
We are addressing this challenge by improving our software, docs, and messaging, and then using forums like this and podcasts to communicate our principles and values to potential users directly.
What do you think the future of data and data infrastructure will look like? What will transform it the most?
It’s hard to say because there is such a long way to go. It’s a fractured ecosystem with so many moving parts, and they are rapidly evolving. It’s also a critically important ecosystem. Huge decisions - both automated and human - are made based on the data and insights produced by these systems, and we are in the early days of making them reliable and understandable.
Who should attend your talk at Scale By the Bay?
My talk will be relevant to anyone building or deploying ETL or ML systems of even moderate complexity. Given the state of modern tooling, anyone building those systems quickly needs tools to better build, structure, and test those applications, and those tools are wanting in today’s ecosystem. We think Dagster fills a real need for that type of builder.
Don't miss Nick Schrock and his talk Dagster: a Framework for Data Processing Applications at Scale By the Bay in Oakland on November 14th. Book your ticket now.