🚀 AIBrix v0.6.0 Release

Today we’re excited to announce AIBrix v0.6.0, a release that expands how you deploy and route inference traffic. Key highlights include:

  • Envoy Sidecar Support – Run Envoy alongside the gateway-plugin without requiring a separate Envoy Gateway controller, simplifying deployments.
  • Intelligent Mixed-Workload Routing — Run prefill/decode-optimized pods alongside standard inference pods and route requests dynamically based on workload patterns and system load.
  • Routing Profiles – Define multiple routing behaviors in a single model configuration and select them per request using a header.
  • Improved LoRA Artifact Delivery – Artifact downloads are now fully handled by the AIBrix runtime, with direct credential passing, first-class AWS S3 support, and non-blocking async downloads.
  • Expanded API Surface
    • OpenAI-compatible audio APIs: /v1/audio/transcriptions, /v1/audio/translations
    • New endpoints: /v1/classify and /v1/rerank

Together, these updates make AIBrix v0.6.0 easier to deploy, easier to observe, and more adaptable for production AI workloads. For the complete list of changes, commit history, and contributor details, see the AIBrix v0.6.0 Release Notes.

v0.6.0 Highlight Features

Envoy as a Sidecar: Simplifying Gateway Deployments

This release introduces support for running Envoy as a sidecar alongside the AIBrix gateway-plugin. Instead of relying on an external Envoy Gateway controller, operators can now embed Envoy directly within the same pod as the gateway plugin. This approach provides a lighter-weight deployment option and reduces the architectural complexity of the gateway stack.

The new mode is controlled through the envoyAsSideCar flag in the Helm chart. When enabled, Envoy runs as a sidecar container that shares the lifecycle of the gateway-plugin pod. This removes the hard dependency on Envoy Gateway while giving operators more direct control over Envoy configuration and behavior.

Flexible Deployment Modes

With this change, AIBrix now supports two mutually exclusive deployment patterns, allowing teams to choose the model that best fits their infrastructure and operational preferences.

+---------------------------+        +-----------------------------+
|     Envoy Gateway Mode    |        |     Envoy Sidecar Mode      |
|     (current default)     |        |        (new option)         |
|                           |        |                             |
|  Envoy Gateway Controller |        |  AIBrix Gateway Plugin Pod  |
|        + HTTPRoute        |        |   + Envoy Sidecar Container |
+---------------------------+        +-----------------------------+

In Envoy Gateway mode, Envoy is managed through a separate control-plane component that uses the Kubernetes Gateway API. Resources such as GatewayClass, EnvoyExtensionPolicy, and HTTPRoute define how traffic is routed and processed, following a controller-driven architecture.

In contrast, Envoy Sidecar mode runs Envoy directly within the gateway-plugin pod. Envoy receives its configuration from a ConfigMap and is exposed through the gateway-plugin service, eliminating the need for Gateway API controllers. This model simplifies networking and reduces the number of required cluster resources.


Smarter Request Routing for Mixed LLM Workloads

Modern LLM inference workloads are rarely uniform. Some requests contain long prompts that benefit from specialized execution pipelines, while others are short and interactive. Designing infrastructure that efficiently handles both types of requests can be challenging.

In the latest AIBrix release, we introduce a new routing capability that allows the gateway to intelligently route requests across different types of inference pods within the same deployment. This enables AIBrix to dynamically choose between prefill/decode–optimized pods and standard inference pods, improving overall performance, flexibility, and resource utilization.

Instead of forcing operators to choose one architecture or maintain separate deployments, AIBrix can now run both approaches together and automatically decide which pod should handle each request.

Intelligent Routing Across Pod Types

With this new routing strategy, AIBrix can direct requests to the most appropriate execution path based on workload characteristics and real-time system conditions.

Each pod type serves a different role:

  • Prefill/Decode Disaggregated Pods (PD Pods) — Designed for workloads where separating prefill and decode stages improves efficiency. These pods are particularly effective for long prompts or workloads dominated by heavy prompt processing.

  • Standard Inference Pods — Execute the entire request lifecycle within a single process. These pods are well suited for short prompts and interactive requests, and can also absorb traffic when PD resources are busy.

The AIBrix gateway continuously evaluates system conditions and routes each request to the best available pod.

How Routing Works

At the core of this capability is an updated routing algorithm that evaluates available pods and dynamically selects the best candidate.

                      +---------------------------+
                      |         Client            |
                      +-------------+-------------+
                                    |
                                    â–¼
                        Routing Algorithm (Gateway)
                                    |
          +-------------------------+------------------------+
          |                                                  |
          â–¼                                                  â–¼
 +------------------------+                    +------------------------+
 |  Prefill/Decode Pods   |                    | Standard Inference Pods|
 | (Disaggregated Stages) |                    | (Single Execution Path)|
 +------------------------+                    +------------------------+
        â–²                                                 â–²
        |  Selected for long prompts or                    |  Selected for short prompts
        |  prefill-heavy workloads                         |  or when PD capacity is busy

The routing decision incorporates several signals, including:

  • Current pod load
  • Queue depth
  • Pod availability
  • Scoring logic used to rank candidate pods

This scoring system allows AIBrix to distribute traffic efficiently while maintaining stable latency.

Benefits

Supporting both execution models within the same deployment allows AIBrix to adapt dynamically to changing workloads.

Key benefits include:

  • Optimized handling of mixed workloads — Long prompts are routed to prefill/decode pods, while short requests are handled efficiently by standard inference pods.
  • Graceful handling of traffic spikes — Standard inference pods can absorb overflow traffic when PD resources are saturated.
  • Single deployment architecture — Run multiple execution models for the same model without managing separate clusters.
  • Dynamic routing decisions — Traffic is distributed based on real-time system conditions instead of static configuration.
  • Improved GPU utilization — Requests are balanced across available pods to maximize throughput and efficiency.

Routing Profiles: One Deployment, Multiple Routing Behaviors

As inference workloads grow more diverse, different types of traffic often require different routing strategies. Some requests benefit from PD (prefill/decode) routing, others prioritize low latency, while general workloads may only need simple load balancing. Traditionally, supporting these variations required multiple deployments or complex label configurations.

With this release, AIBrix introduces Routing Profiles — a new way to define multiple routing behaviors within a single model configuration and select them dynamically per request.

Instead of spreading routing settings across pod labels or maintaining separate deployments, you can define multiple named routing profiles inside the model.aibrix.ai/config annotation. Clients then select the desired routing behavior at request time using the config-profile header.

This allows a single model deployment to handle multiple traffic patterns — such as general workloads, PD routing, or low-latency requests — all using the same set of pods.

Defining Routing Profiles

Routing profiles are defined as structured JSON inside the model.aibrix.ai/config annotation. Each profile can configure routing behavior and parameters such as:

  • routingStrategy (e.g. random, pd, or least-latency)
  • Prompt-length bucket ranges used by PD routing (promptLenBucketMinLength, promptLenBucketMaxLength)
  • Whether pods operate in combined prefill/decode mode

You also define a defaultProfile, which acts as a fallback if a client does not specify a profile. These are the options currently supported; additional profile configuration options will be added gradually.

Example:

{
  "defaultProfile": "pd",
  "profiles": {
    "default": {
      "routingStrategy": "random",
      "promptLenBucketMinLength": 0,
      "promptLenBucketMaxLength": 4096
    },
    "pd": {
      "routingStrategy": "pd",
      "promptLenBucketMinLength": 0,
      "promptLenBucketMaxLength": 2048
    },
    "low-latency": {
      "routingStrategy": "least-latency",
      "promptLenBucketMinLength": 0,
      "promptLenBucketMaxLength": 2048
    }
  }
}

In this example, the pd profile is configured as the default. Clients can explicitly choose default, pd, or low-latency depending on their workload. If no profile is provided, AIBrix automatically falls back to the default profile.

Why Routing Profiles Matter

Routing Profiles simplify routing configuration while making deployments far more flexible. Instead of creating separate deployments for different traffic patterns, operators can define multiple routing behaviors in a single configuration and select them dynamically.

This approach provides several benefits:

  • Single source of truth for routing configuration
  • Per-request flexibility without extra deployments
  • Cleaner gateway logic that is easier to manage
  • Clear separation between workload types (batch vs. interactive, PD vs. single-pod)

In practice, this means one deployment can support many routing behaviors, allowing AIBrix to adapt to different workload patterns without increasing operational complexity.


Streamlined LoRA Artifact Delivery with AIBrix Runtime

Managing LoRA adapters often requires coordinating credentials, storage access, and runtime behavior. To simplify this, AIBrix now moves LoRA artifact preparation and delivery entirely into the runtime, making it the single source of truth for downloading and preparing model artifacts.

This change simplifies credential handling, centralizes artifact management, and keeps the runtime responsive during downloads.

Runtime as the Source of Truth

Previously, artifact preparation involved coordination between controllers and the runtime. Now the AIBrix runtime handles artifact validation and downloads directly, while the controller simply passes the required information such as credentials.

This clearer separation reduces operational complexity and simplifies the LoRA adapter lifecycle.

Simplified Credential Flow

The modeladapter controller now retrieves credentials directly from the referenced Kubernetes Secret and embeds them into the LoRA load request.

The flow is straightforward:

  • The controller reads the Kubernetes Secret.
  • Converts secret.Data into a key-value map.
  • Sends the credentials directly to the runtime.

This makes it easier to support IAM-style credentials for S3-compatible storage systems and removes ambiguity around artifact access configuration.

Easy Configuration via artifactURL + Secret Reference

In practice, you can now point the adapter to a specific LoRA artifact version by providing an S3 URL and a Kubernetes Secret reference. For example:

artifactURL: "s3://<YOUR_S3_URI>"
credentialsSecretRef:
  name: <YOUR_AWS_CREDENTIAL_SECRET_NAME>

With this setup, the runtime can automatically fetch and prepare the exact LoRA artifacts from object storage, making it effortless to pin, switch, and roll out a desired adapter version.

Non-Blocking Artifact Downloads

Since object storage SDKs often rely on blocking I/O, artifact downloads are executed using:

asyncio.to_thread

This runs downloads in a worker thread, keeping the async runtime responsive and allowing concurrent requests to continue while artifacts are being fetched.

A Simpler, More Reliable Pipeline

With these updates, LoRA artifact delivery becomes more streamlined. The runtime now manages artifact preparation centrally, credentials flow more cleanly, and downloads no longer block the event loop—making LoRA adapter management more reliable and production-ready.


New API Endpoints and Custom Routing Support

We’re excited to introduce several new API endpoints, along with enhanced flexibility for custom routing paths in your deployments.

New Endpoints

The following endpoints are now available:

  • /v1/audio/transcriptions – Convert audio to text with high-accuracy speech recognition.
  • /v1/audio/translations – Transcribe and translate audio into your target language.
  • /v1/classify – Perform text classification tasks with optimized model inference.
  • /v1/rerank – Improve retrieval quality by reranking candidate results based on relevance.

These additions expand support for speech processing, content classification, and retrieval optimization workflows.


Other Improvements

v0.6.0 also expands observability, deployment options, and stability:

  • Observability & metrics: Gateway metrics collected directly from the gateway layer; granular inference request metrics; new SGLang gateway metrics dashboard; Prometheus auth via Kubernetes secrets and query queueing.
  • Deployment & installation: Gateway plugin can run in standalone mode without Kubernetes; simplified Docker Compose installation for local and dev setups.
  • StormService & control plane: Per-role revision tracking; role-level status aggregation; periodic reconciliation for ModelAdapter; dynamic discovery provider updates; improved PodGroup and scheduling strategy handling.
  • KVCache framework: Block-first layout support; padding token support in CUDA kernels; aibrix_pd_reuse_connector for combined PD reuse workflows.
  • Gateway & routing: Session affinity routing; external header filters for advanced routing; least-request routing for distributed DP API servers.
  • Bug fixes: RDMA issues in P/D setups (SGLang/vLLM); Redis auth in Helm; divide-by-zero in APA autoscaling; envoy extension policy paths and gateway service configuration; metrics label cardinality panic; CGO build alignment for builder/runtime env.

Contributors & Community

This v0.6.0 release includes 95 merged PRs, with 15 from first-time contributors 💫. Thank you to everyone who helped shape this release through code, issues, reviews, and feedback.

Special shout-out to @Jeffwan, @varungup90, @googs1025, @scarlet25151, @DwyaneShi, and @nurali-techie for their continued work on reliability, gateway improvements, control-plane evolution, and documentation.

We’re excited to welcome the following new contributors to the AIBrix community:

@sherlockkenan, @dczhu, @sceneryback, @Deepam02, @rayne-Li, @n0gu-furiosa, @cabrinha, @sanmuny, @erictanjn, @fungaren, @paranoidRick, @pbillaut, @alpe, @liangdong1201, @yahavb 🙌

Your contributions continue to make AIBrix more scalable, production-ready, and welcoming as an open community. We’re excited to see the ecosystem grow—keep them coming!

Next Steps

If you’re running LLMs in production or exploring architectures around serverless, KV cache, or P/D disaggregation, we’d love your feedback and collaboration. Check out the v0.7.0 roadmap, join the discussion, and contribute on GitHub.