ASPLOS’25 Tutorial - AIBrix: An Open Source Large-Scale LLM Inference infrastructure for System Research
Abstract
The rapid advancements in large language model (LLM) inference have spurred numerous innovations. Our findings suggest that for LLMs to be effectively deployed in production environments, optimization at the engine level alone is insufficient. Successful production deployment demands a holistic approach that integrates optimizations across three key layers: the model layer, the engine layer, and the system layer.
AIBrix is an open-source system-level solution designed to address the complexities of LLM inference in production environments. It provides a seamless platform to transform large models into scalable APIs, focusing on critical system aspects such as LLM specific autoscaling strategies, model locality scheduling, cost-efficient management of heterogeneous hardware, and efficient online and offline request colocation. AIBrix facilitates cutting-edge research in large-scale LLM inference by offering researchers a flexible framework to explore system-level challenges, accelerating innovation in areas that go beyond engine optimization. Some popular paper ideas like OSDI’24 ServerlessLLM, ASPLOS’24 QLM, Preble, have been integrated into AIBrix for benchmarking.
Location & Time
- Venue: Postillion Hotel & Convention Centre WTC Rotterdam, Rotterdam, The Netherlands
- Room: Leeuwen room II
- Date & Time: Sunday 30 March, 2025, Afternoon
Tentative Schedule
- AIBrix: Testbed for Public Cloud LLM Serving (45mins)
- Routing Innovations for LLM Inference: Unlocking Lower P99 Latencies for System Efficiency (30mins)
- Cost-Effective and QoS-Aware LLM Inference using a Diverse Pool of Heterogenous Instances (30mins)
- Resource Isolation in Multi-LoRA Serving (30mins)
- Global Traffic Router for Multi-regional LLM Services (30mins)
Note: All the slides will be made available shortly before the tutorail.
Organizer
Jiaxin Shan, Software Engineer in Bytedance Infrastructure Research Lab. He received a MS degree from University of Pittsburgh. His research interests focus on ML Infra and Serverless systems. He is a co-chair in Kubernetes WG-Serving and Kubeflow community. | |
Le Xu is a Researcher in Bytedance. She got her Ph.D. degree from UIUC, advised by Professor Indranil Gupta. Her research focuses on distributed systems, streaming systems and AI systems. She has authored several publications in top-tier conferences, including NSDI, SoCC, and EuroSys. | |
Shuowei Jin is a fifth-year PhD candidate at the University of Michigan, advised by Professor Morley Mao. His research focuses on enhancing LLM inference efficiency through algorithm and systems codesign. | |
Rong Kang is a Research Engineer at ByteDance. He obtained his Ph.D. degree from Tsinghua University. Passionate about the synergy between AI and systems, his academic and engineering interests are centered around AI for DB, DB for AI, and LLM Serving. | |
Linhui Xu, ByteDance Research Engineer. Master from Institute of Computing Technology Chinese Academy of Sciences. Interested in AI for DB, LLM Acceleration. | |
Liguang Xie is the manager of a research team in the computer area at Bytedance. He received a Ph.D. degree in computer engineering from Virginia Tech. His research interests include optimization and algorithm design for wireless networks, LLM inference System. |
Contact us
For any further questions, please contact Jiaxin Shan or connect on LinkedIn.