Nick is a Senior Research Engineer focused on scalable serving of large language models. He previously led the architecture and development of distributed machine learning infrastructure supporting key IBM AI cloud products and services. He designed and implemented the Model-Mesh serving framework that supports hundreds of thousands of models, now a key component of the KServe open source project.
Efficiently serving LLMs at scale.
n this talk I will discuss challenges of serving LLMs efficiently in highly concurrent, multi-user contexts, and some of the optimizations unique to these kinds of models that have emerged over the the last year. In particular I'll explain "continuous batching" of heterogeneous requests.