ASPLOS’25 Tutorial - AIBrix: An Open Source Large-Scale LLM Inference Infrastructure for System Research

Abstract

The rapid advancements in large language model (LLM) inference have spurred numerous innovations. Our findings suggest that for LLMs to be effectively deployed in production environments, optimization at the engine level alone is insufficient. Successful production deployment demands a holistic approach that integrates optimizations across three key layers: the model layer, the engine layer, and the system layer.

AIBrix is an open-source system-level solution designed to address the complexities of LLM inference in production environments. It provides a seamless platform to transform large models into scalable APIs, focusing on critical system aspects such as LLM specific autoscaling strategies, model locality scheduling, cost-efficient management of heterogeneous hardware, and efficient online and offline request colocation. AIBrix facilitates cutting-edge research in large-scale LLM inference by offering researchers a flexible framework to explore system-level challenges, accelerating innovation in areas that go beyond engine optimization. Some popular paper ideas like OSDI’24 ServerlessLLM, ASPLOS’24 QLM, Preble, have been integrated into AIBrix for benchmarking.

Location & Time

Venue: Postillion Hotel & Convention Centre WTC Rotterdam, Rotterdam, The Netherlands
Room: Leeuwen room II
Date & Time: Sunday 30 March, 2025, Afternoon 2:00 PM to 5:30 PM

Venue Map & Directions

Schedule

Section 1 (2:00PM - 3:30PM):

Introducing AIbrix: A Testbed for Large-Scale LLM Inference System Research
LLM-Tailored Autoscaling: Leveraging LLM-specific Metrics and Embracing Resource Heterogeneity
Reducing Inference Bottlenecks in Shared Prompt Environments with Prefix Caching
KV Cache Offloading for Cross-Engine KV Reuse
Multi-LoRA Management in Production Environment
Open Research Challenges in LLM Inference systems

Section 2 (4:00PM - 5:30PM):

Hands-on AIBrix Feature Demo in AWS Studio Workshop

Note: All the slides will be made available shortly before the tutorials.

Organizer

Author	Introduction
	Jiaxin Shan, Software Engineer in Bytedance Serverless Compute Infrastructure Team. He received a MS degree from University of Pittsburgh. His research interests focus on ML Infra and Serverless systems. He is a co-chair in Kubernetes WG-Serving and Kubeflow community.
	Le Xu is a Researcher in Bytedance. She got her Ph.D. degree from UIUC, advised by Professor Indranil Gupta. Her research focuses on distributed systems, streaming systems and AI systems. She has authored several publications in top-tier conferences, including NSDI, SoCC, and EuroSys.
	Haiyang Shi is a Researcher/Engineer in Bytedance Serverless Compute Infrastructure Team. He obtained his Ph.D. from The Ohio State University, advised by Prof. Xiaoyi Lu. His research focuses on distributed systems, AI infra, and high-performance interconnects and protocols.
	Gangmuk Lim is a Ph.D. student at University of Illinois at Urbana-Champaign and is currently working as a Research Intern at ByteDance. His research area is improving performance and resilience of microservice applications, and LLM inference application in the cloud-native environment. He is particularly interested in request routing leveraging application layer knowledge.
	Ning Wang works as a Research Engineer at ByteDance. He earned his Ph.D. from Temple University. His research interests include developing innovative on-device and cloud-based AI systems and applications.
	Shuowei Jin is a fifth-year PhD candidate at the University of Michigan, advised by Professor Morley Mao. His research focuses on enhancing LLM inference efficiency through algorithm and systems codesign.
	Rong Kang is a Research Engineer at ByteDance. He obtained his Ph.D. degree from Tsinghua University. Passionate about the synergy between AI and systems, his academic and engineering interests are centered around AI for DB, DB for AI, and LLM Serving.
	Linhui Xu, ByteDance Research Engineer. Master from Institute of Computing Technology Chinese Academy of Sciences. He is interested in AI for DB, LLM Acceleration.
	Premdass works as Specialist TAM for AWS, on the enterprise support team. He works on helping AWS customers solve scaling problems of the kubernetes clusters. His interests focus on cloud infrastructure, ML Infra and startups.
	Liguang Xie is the Director of Engineering, Serverless Compute Infrastructure at ByteDance, where he leads the growth of compute infrastructure across North America, Europe, and Asia Pacific to support the company’s global business units. He holds a Ph.D. in Computer Engineering from Virginia Tech and is passionate about accelerating innovation in AI and machine learning. A recipient of the IEEE INFOCOM 2023 Test of Time Award, Liguang’s research focuses on cloud infrastructure, cloud-native architecture, large-scale LLM inference systems, and machine learning systems and infrastructure.

Contact us

For any further questions, please reach out to us on vLLM Slack Channel or send email to Liguang Xie or Jiaxin Shan.