Airbyte, our Gold sponsor, emerges as a dazzling star, illuminating the path for those navigating the complex corridors of data integration. Today, dear readers, we're not just sharing an interview; we're inviting you on an odyssey. An odyssey with Sherif Nada, the mastermind behind Airbyte's engineering marvels, and Ben Church (check out his talk), the software sorcerer who crafts magic with every line of code. So, grab your virtual telescopes, and let's embark on this celestial journey through the galaxies of data, AI, and the future!
1. Sherif, please tell us about Airbyte and your role there.
I'm an engineering manager at Airbyte where I lead our engineering teams working on API connectors, Connector developer tooling, and connector portability efforts.
2. The main theme of our conference this year is "The Code and Data in the Age of AI". You are an expert in data integration & management with years of experience working with high-load systems. AI might seem to be a new emerging tech that requires new solutions for everything. While it is true for some areas, we'd like to know what lessons you learned along your way that can and should be considered when building effective AI implementation?
A large amount of AI engineering problems are actually data engineering problems. This is good news! It means that AI engineers can re-use a lot of the existing solutions in the space instead of inventing from scratch.
3. How is data management different for AI systems? What are the top challenges AI engineers must solve to move data around?
Whereas the paradigm has mostly moved towards ELT in the last few years, AI brings us right back to ETL. For example, most AI RAG applications require data to be non-trivially transformed before loading into a vector store for use in the AI app. This means that it's required to run the T before the L, which means that existing ELT solutions need to provide solutions for the new (but old) paradigm. This includes potentially offering the Extractors and Loaders as embeddable code components, or establishing a standard pattern where the EL platform provides all the data deduplicated and ready for consumption by AI developers for their RAG apps.
4. Looking into the future, 5-10 years from now, what's your prediction on these challenges and solutions? What looks promising?
My hot take is that vector databases being separate from "normal" databases is a big detriment to the developer experience due to the added infra management overhead and the need to keep data reconciled between two different siloed. I think OLAP and OLTP databases will attempt to take on that from vector databases and potentially make them obsolete. Yet to be seen if they will succeed!
In the digital tapestry of 2023, some figures weave patterns of brilliance. Ben Church is one such figure. A maestro in software engineering and a stalwart in the open-source community, today we delve into the insights and passions of this Senior Software Engineer at Airbyte. Join us as we navigate the intricate pathways of AI, data embeddings, and the future of open-source with Ben.
1. In the bustling digital landscape of 2023, with LLMs and AI taking center stage, Vector Embeddings have emerged as pivotal tools for personalizing and contextualizing information. As an engineer, when do you believe it's most appropriate to leverage embeddings, and what primary considerations or constraints should be prioritized in their application?
Embeddings serve as numerical fingerprints for various data types, from text and images to audio clips. Essentially, they're vectors or arrays of weights that distill the core characteristics of any given data. This allows us to discern how similar or distinct two pieces of data might be when processed by a machine.
In real-world applications, the usage of embeddings often involves multiple steps. Let's take an example. If someone queries, "What is the biggest tree in the world?", the first step would be to convert this query into a vector using an LLM. Then, this vector is used to search for similar documents or records in a datastore where vectors associated with relevant documents have been saved. Post this search, you'll retrieve documents that are contextually close to the query, such as articles or lists about large trees worldwide. The final step involves taking these retrieved documents and combining them with the original query to form a more enriched context. This composite data is then sent back to the LLM, hoping that with this enhanced context, the model can pinpoint a more accurate answer.
However, these systems have their own complexities and constraints. The primary challenge arises from the token limit imposed by most language models. Every model has a maximum number of tokens it can process in one go. ChatGPT for example can only handle approximately 4000 tokens in a single query. Given this constraint, there's a limitation on how many embeddings or pieces of data you can feed into an LLM. This directly impacts how you design your datastore: the size and type of data you decide to convert into vectors, and how you prioritize or truncate information during subsequent LLM interactions.
Consequently, while embeddings offer a robust mechanism to distill, compare, and retrieve information, their design and application require careful consideration. It's essential to understand the underlying model's limitations and strategize data storage and retrieval processes accordingly.
2. In your talk, you will provide a solution using a vector database. Why would one want to use a vector database to store embeddings? What's the difference between a vector index and a vector database?
A vector database is specifically designed to store and manage high-dimensional vectors like embeddings. Unlike traditional databases that deal with structured data types (like integers, text, or dates), vector databases are optimized for storing, querying, and retrieving vectors efficiently. This efficiency is crucial, given the immense computational resources that could be consumed when dealing with millions or even billions of embeddings.
One of the primary reasons to use a vector database is its capability to perform fast similarity searches. When you query with a particular vector, the database can rapidly identify and retrieve vectors that are 'closest' to the queried vector, based on a certain distance metric (like cosine similarity or Euclidean distance).
Now, addressing the difference between a vector index and a vector database:
A vector index is essentially a data structure designed for the efficient organization and access of vectors based on their content, enabling fast similarity searches. A vector database is datastore focused on the efficient storing and utilization of data with vector indexes.
Though this is just today. Looking into the future the definition and utility of a vector database is set to expand. LLMs have certainly put a focus on the utility of vectors (embeddings), and as a result they’ve expanded how and when they are used. Many are moving beyond basic similarity searching, and beginning to concern themselves with transforming data into vectors, either at the object level, or the query level, abstracting away the need for users to write the code that call the models embedding function. I only imagine these specialized features will grow as we continue to discover the best practise of building systems and products around LLMs and embeddings.
As a result its worth while to consider when designing these systems if you should use a Vector Database, not just for speed improvements in your similarity searches, but to avoid having to invent and/or reinvent tooling as the space inevitably changes.
3. Ben, you are heavily involved in OSS. If it was outside your job responsibilities, how much time would you spend on OSS? Why?
Initially diving into OSS for me was part exploration and part frustration. There were moments when tech felt more complex than it needed to be. Whenever that happened I felt the pull to create the tutorial, or library that I needed but never found, so that the next developer could focus on building their product not learning some esoteric protocol.
Though along the way I found out that frustration gave way to an enjoyment of teaching. Of making complex topics simpler. Product cycles faster.
Before joining Airbyte, I was in deep with communities like Elixir, React, and GraphQL. They shaped how I see open-source. Now, even though I'm with Airbyte, my dedication to the world of OSS is its own thing. I don't see it changing based on my job, but yeah, family priorities might shuffle things a bit.
4. Projects like co-pilot and ChatGPT use OSS without crediting the OSS creators. What are your thoughts on this?
Speaking personally, and not for Airbyte here.
Finding a balance in acknowledging OSS in models in the context of model training is far from easy. There are far too many grey areas. It's clear that OSS licenses should be respected and Creators should have a say how their IP is used. However I’m not certain any commonly used license today has the appropriate language for dealing with if the source code is to be used in model training or not. Then there is the question of what constitutes IP. The Postgresql codebase would likely meet any definition, but how about those one off experiments I have on github from university? or the first time “if x = 5:” was written? In any sense I certainly don’t have the answer to that.
And as our cosmic conversation draws to a close, it's evident that the universe of data integration and AI is vast, mysterious, and full of wonders. Airbyte, with its constellation of solutions, is not just a beacon in this universe; it's a guiding star. Sherif and Ben, with their profound insights, have not only shared the present landscape but have also charted the course for the future. As we continue our digital voyage, it's pioneers like them and trailblazing companies like Airbyte that will lead the way, ensuring that the future is not just about bytes and bits, but about dreams, visions, and infinite possibilities.
As we look to the horizon, we're thrilled to invite all of you to join us at the SBTB conference, where we'll further explore these digital frontiers and rock the world of data and AI. So, as we sign off, remember, in the world of data and AI, the sky isn't the limit; it's just the beginning. Until our next digital adventure, keep exploring, keep innovating, and keep reaching for the stars!