🚀 AIBrix v0.5.0 Release
Today, we’re excited to announce AIBrix v0.5.0, a release that pushes AIBrix closer to a batteries-included control plane for modern LLM workloads. This release introduces an OpenAI-compatible Batch API for high-volume, latency-insensitive workloads, making it easy to offload large offline and evaluation jobs without overloading real-time endpoints. We also introduce a new KVCache Connector AIBrixOffloadingConnectorV1Type3 that enables pipelined KVCache prefetching and loading and layer-wise KVCache offloading for efficient KVCache offloading. v0.5.0 turns StormService into a production-grade control plane for P/D disaggregation with PodSet/PodGroup primitives for multi-pod management, topology- and load-aware P/D routing, and role-level autoscaling via subTargetSelector for fine-grained prefill/decode scaling.

v0.5.0 Highlight Features
Batch API
We are excited to announce the release of the AIBrix Batch API, a powerful new feature designed to optimize high-volume, latency-insensitive inference workloads.
As GenAI applications scale, not every request requires an immediate response. Tasks such as large-scale dataset evaluation, offline content generation, and bulk data processing often clog up real-time serving endpoints, leading to inefficient resource utilization and higher costs.
The Aibrix Batch API addresses this by allowing users to submit large volumes of requests asynchronously. By processing these requests in optimally sized batches, Aibrix can significantly increase GPU saturation and overall cluster throughput compared to standard online serving.
Key Features
- OpenAI Compatibility: Built to be a drop-in replacement for existing workflows, supporting standard OpenAI Batch API format (e.g.,
/v1/batches). - Asynchronous Processing: “Fire and forget” architecture. Submit massive jobs via
.jsonlfiles and retrieve results when ready, freeing up your client applications. - Configurable Job Pools: Fine-tune your resource allocation with configurable job pool sizes to match your specific hardware constraints and throughput goals.
- Enhanced Error Handling: Robust validation and error reporting support auto-retrial of requests, ensuring you can track the status of every individual request within a massive batch.
Quick Start Because Aibrix is OpenAI-compatible, getting started is straightforward for anyone familiar with standard LLM tooling.
- Prepare your batch file (
requests.jsonl):
{"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "llama-3-8b", "messages": [{"role": "user", "content": "Hello world!"}]}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "llama-3-8b", "messages": [{"role": "user", "content": "Explain batch processing."}]}}
- Submit the batch (using standard OpenAI client pointed to AIBrix):
from openai import OpenAI
client = OpenAI(
base_url="http://your-aibrix-endpoint/v1",
api_key="aibrix"
)
# Upload file
batch_file = client.files.create(
file=open("requests.jsonl", "rb"),
purpose="batch"
)
# Create batch job
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch submitted: {batch_job.id}")
Learn More
For full deploy configurations and API references, please visit our official documentation.
KVCache v1 Connector
Our benchmarking and real-world internal deployments with AIBrix KVCache connector released in v0.4.0 clearly revealed significant opportunities for enhancing performance. To capitalize on these opportunities, we enabled several key optimizations and introduced a new KVCache connector (i.e., AIBrixOffloadingConnectorV1Type3).
Key Features
- Pipelined KVCache Prefetching and Loading: This optimization allows the three critical stages—prefetch, load, and compute—to overlap and run simultaneously, dramatically reducing idle time, eliminating latency penalties on TPOT, and unlock new levels of throughput performance.
- Layer-wise KVCache Offloading: This feature hides the latency of KVCache transfer by performing offloading concurrently with each layer’s forward pass, enabling efficient inference even with a low level of cache hit ratio. This ensures the computational engine is almost always busy and resource utilization is maximized.
The results from our benchmarks are compelling. Compared with the legacy AIBrix KVCache connector (i.e., AIBrixOffloadingConnectorV1Type1), for models such as Llama 3.1 70B with Tensor Parallelism (TP)=8, these combined optimizations deliver over a 20% improvement in both TPOT and overall throughput, while still maintaining efficient TTFT.
Production-Grade Prefill/Decode (P/D) Orchestration
v0.5.0 turns StormService into a full-fledged control plane for large, disaggregated P/D clusters. A new PodGroup API allows StormService to integrate with coscheduling ecosystems (Coscheduling, Godel, Volcano), so tightly coupled workers are treated as a single schedulable unit. Together with the new PodSet API, StormService can now manage multi-pod workers and shard groups explicitly—controlling their lifecycle, topology, and health as one logical entity while remaining backward compatible with existing single-pod setups.
On top of that, v0.5.0 introduces stronger rollout and restart semantics for complex deployments. The new FullRecreate strategy gives operators an atomic way to recover unhealthy PodSets without leaking partial state, while role upgrade sequences let you roll out changes in a safe, ordered fashion across roles (e.g., incluster router → prefill → decode) instead of random updates. This combination makes high-risk operations (schema changes, routing changes, runtime bumps) far more predictable.
spec:
roles:
- name: prefill
replicas: 3
podGroupSize: 2 # introduce new field to indicate the pod group size. for example DP or TP case.
stateful: true
recoveryPolicy: ReplaceUnhealthy # ReplaceUnhealthy or Recreate
template:
...
The P/D routing layer is upgraded to understand these orchestration primitives. In replication and pooled modes, AIBrix now prefers pairing prefill and decode workers from the same RoleSet/PodSet, uses load-aware scoring to select the least-busy candidates, and aligns Nixl-based P/D disaggregation with the correct kv_transfer_params so traffic lands on the right group with the right cache state. Additional safeguards ensure routing logic respects HttpRoute status and failure conditions, closing correctness gaps seen in earlier releases.
Finally, we’ve added role-level autoscaling for StormService so prefill and decode can scale independently based on their own signals. Using the new subTargetSelector in PodAutoscaler, operators can attach distinct autoscalers to different roles or pools (for example, aggressive scaling for prefill, conservative for decode), which is essential for P/D pooling scenarios and heterogeneous clusters. This makes P/D disaggregation not just possible, but operationally clean at scale.
# PodAutoscaler for prefill role
apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
name: ss-pool-prefill
namespace: default
annotations:
autoscaling.aibrix.ai/storm-service-mode: "pool"
spec:
scaleTargetRef:
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
name: ss-pool
# new Added: Select the prefill role within the StormService
subTargetSelector:
roleName: prefill
Full example: https://github.com/vllm-project/aibrix/blob/main/samples/autoscaling/stormservice-pool.yaml
Other Improvements
v0.5.0 also strengthens the runtime layer so operators get a cleaner, more consistent control path across all engines. The metadata server is migrated from Go to Python and wired with liveness/readiness probes and a slimmer image footprint, while the downloader now handles recursive object storage layouts more robustly—making it easier to standardize model and artifact management in real clusters. Webhooks and lightweight wrapper libraries auto-inject the AIBrixRuntime sidecar into Deployments and StormService workloads, so metrics, downloads, and admin operations are unified without bespoke per-engine glue.
On top of this, we’ve hardened LoRA and Model Adapter workflows for real multi-tenant usage. AIBrix now supports scaling adapters to desired replicas, refactors how adapter replicas are tracked, adds typed wrappers for easier integration, and allows LoRA artifacts to be pulled directly via the runtime. Together, these changes make running a lot of LoRAs per base model—across multiple engines and clusters—far less fragile and much easier to automate.
The autoscaling stack gets a similar level-up. v0.5.0 unifies metrics fetching with a retryable RestMetricsFetcher, a shared client/aggregator, and race-condition fixes in configuration updates, so scaling decisions are both faster and more reliable. We tune KPA defaults, add metric label selector support, only emit events when replica counts actually change, and expose scaling history directly in the PodAutoscaler status. This turns autoscaling into an observable, debuggable, and policy-driven component instead of a black box.
Taken together, these runtime, LoRA, injection, and autoscaling improvements push AIBrix closer to a batteries-included control plane: one runtime sidecar, one metrics and scaling story, one adapter management path—reused consistently across vLLM, SGLang, batch workers, and future engines.
For a full list of changes, commit history, and contributor details, please check out the AIBrix v0.5.0 Release Notes.
Contributors & Community
This v0.5.0 release includes 39 contributions, with 21 of them coming from first-time contributors đź’«. A huge thank-you to everyone who helped shape this release through code, issues, reviews, and feedback.
Special shout-out to our new contributors below and @googs1025 and @omerap12 for driving our control-plane hardening and reliability efforts.
@JonathonShea, @bigerous, @jiangxiaobin96, @mayooot, @zyfy29, @zhengkezhou1, @tianzhiqiang3, @atakli, @jwjwjw3, @lx1036, @chethanuk, @baozixiaoxixi, @TylerGillson, @omrishiv, @lex1ng, @ChenTaoyu-SJTU, @zhenyu-02, @yapple, @xvoron, @freedown19, @Leafykn 🙌
Your contributions help make AIBrix more robust, production-ready, and welcoming as an open community. Keep them coming!
Next Steps
We’re continuing to push AIBrix toward a fully production-grade, cloud-native stack for modern LLM workloads. For v0.6.0, we’re focusing on a few pillars: Production-Grade LoRA & Serverless LLM Services, KV-Cache–Centric P/D & Offloading Workflows, Fault Tolerance & Resilience Exploration and Ecosystem Integrations: vLLM Semantic Router & Envoy AI Gateway which covers areas intentionally kept out of AIBrix’s core, while keeping the stack composable.
If you’re running LLMs in production or exploring new architectures around Serverless, KV cache, or P/D disaggregation, we’d love your feedback and collaboration on the v0.6.0 roadmap—join the discussion and contribute on GitHub.