Since publishing, this approach to model serving has been deprecated. Find the new approach here.
When going from a notebook to a production environment, there are lots of considerations you need to take into account.
In previous posts, I’ve written a little bit about how to integrate Axon models into production environments with both native (Elixir-based) solutions and external model serving solutions.
In this post, I’m excited to introduce a new feature of Axon: Axon.Serving
.
What is Axon.Serving
?
Axon.Serving
is a minimal model serving solution written entirely in Elixir. Axon.Serving
can integrate directly into your existing applications with only a few lines of code. Axon.Serving
makes minimal assumptions about your application’s needs, but implements some critical features for deploying fault-tolerant, low-latency models.
In other ecosystems, users are encouraged to use dedicated serving solutions such as TensorFlow Serving, TorchServe, and NVIDIA’s Triton Server. These are production-ready serving solutions with impressive feature-sets; however, they’re mostly designed to overcome limitations in the lingua franca of machine learning–Python.
Model serving solutions introduce an additional service into your application that you must manage and query with platform-specific GRPC or HTTP APIs. Axon.Serving
is a pure Elixir solution that can integrate directly into your existing applications without introducing an additional service.
Additionally, because Axon
builds on top of Nx, Axon.Serving
is runtime-agnostic by default. That means you can take advantage of various backends and compilers, tuning for specific deployment scenarios just by changing some configuration options. Other serving solutions such as NVIDIA’s Triton Server tout similar features–allowing you to deploy TensorFlow, PyTorch, ONNX, and more models under a unified API.
However, Triton is designed specifically with server deployments in mind. On the other hand, there’s nothing that would limit you from using Axon.Serving
in a mobile deployment environment. It’s feasible to imagine implementing a mobile application with LiveView Native, and deploying on-device models with Axon running on a CoreML backend.
Why can’t I just use Axon.predict/3
?
At first glance, it might not make sense why you would want to use the Axon.Serving
API over just using Axon’s normal inference APIs. In some settings, such as certain batch-inference scenarios, it might make sense to just continue using more straightforward inference APIs like Axon.predict/3
.
However, for online-serving solutions where model requests are made at irregular intervals one inference request at a time, Axon.Serving
is a must for ensuring low latency.
This first release of Axon.Serving
implements two performance-critical features for serving scenarios:
- Eager model compilation with fixed shapes
- Dynamic batch queue
Axon depends on Nx, which makes use of JIT-compilation. When running Nx functions (like Axon models) for the first time, there is a slight compilation overhead. If you run functions with different shapes or types, you will incur a compilation cost every time.
Axon.Serving
works by using Nx’s eager compilation API, Nx.Defn.compile
, to JIT-compile inference functions on application start-up. This means when users make requests to your application, they use the compiled function and do not incur any compilation cost.
In addition to compilation cost, there is a lot of overhead when switching between ERTS and an Nx backend or compiler’s runtime.
When using a GPU, there is even more overhead for transferring data to the GPU. In training scenarios, these overheads are offset with large batch sizes.
Using a batch size of one on modern GPUs ends up wasting significant amounts of resources. That’s because GPU latency is not very sensitive to batch size. A model with an average latency of 100ms at batch-size one will generally exhibit the same latencies (up to a certain point) as when you scale the batch size up.
Rather than service one request with batch-size one at a time, you should try to service requests in bulk to avoid bottlenecks.
Imagine a scenario where you receive requests at 20ms intervals for a model that takes 100ms to process. If you receive five requests, the fifth request will have a perceived latency of 500ms if your application is designed to service model requests one at a time. If you instead choose to batch requests, you can significantly lower the perceived latency of later requests by sacrificing some latency for earlier requests.
In order to achieve this bulk-processing effect, Axon.Serving
implements a dynamic batch queue.
When configuring your model to use Axon.Serving
, you specify a maximum batch size and a maximum wait time. For example, if you specify a maximum batch size of 16
and a maximum wait time of 25
, your model will process requests in batches of 16, waiting at most 25ms for the queue to fill up before executing model inference.
Rather than servicing requests eagerly, Axon.Serving
delays inference until either the queue fills up, or the maximum batch size is reached. To meet Nx’s static-shape requirements, Axon.Serving
sacrifices some memory efficiency by padding all batches to the given maximum batch size.
Despite the simplicity of these two features, Axon.Serving
is capable of achieving competitive performance with other model serving frameworks.
I recently shared benchmarks of Axon.Serving
versus TorchServe on the same model. Axon.Serving
using EXLA integrated with a vanilla Phoenix application is actually more performant than TorchServe.
How do I use Axon.Serving
?
Axon.Serving
requires minimal changes to your application. In an existing Phoenix application, you can start an Axon.Serving
instance by adding the following to your application.ex
:
# start a ResNet instance
{Axon.Serving, model: MyApp.Models.load_resnet(),
name: :resnet,
shape: {nil, 3, 224, 224},
batch_size: 16,
batch_timeout: 100,
compiler: EXLA}
This will start a serving instance named :resnet
. :model
must be a tuple of {model, params}
–in this example you dispatch the model loading code to a separate module. :shape
indicates your model’s input shapes. The specified shape must be compatible with the shape specified in model
. :batch_size
and :batch_timeout
represent the maximum dynamic queue size and queue timeout respectively. All other options (e.g. :compiler
) are forwarded to Nx.Defn.compile
.
Now, whenever you want to get predictions from your model, you can use Axon.Serving.predict/2
:
defmodule MyAppWeb.ImageController do
use MyAppWeb, :controller
def predict(conn, %{"image" => image}) do
image_tensor = normalize_input(image)
result = Axon.Serving.predict(:resnet, image_tensor)
normalize_and_render_output(conn, result)
end
end
Under the hood, Axon.Serving
will compile the ResNet model on application start-up. Overlapping requests to :resnet
will be batched automatically.
And that’s it! Like I said, Axon.Serving
is intentionally minimalistic. With a few lines of Elixir, you can replace an entire external model serving service.
If your application is already using Elixir but defers to Python and a model serving solution for machine learning, you might find it to be a serious quality of life improvement to convert your model into a format Axon can work with, and make use of Axon.Serving
integrated directly with your Elixir application.
What Axon.Serving
can’t do
Axon.Serving
is intentionally minimal. Model serving solutions like TorchServe and Triton are batteries included. They implement things like response caching, rate limiting, model management, and more out-of-the-box.
Axon.Serving
takes a less opinionated approach, allowing you to work these features into your application as you see fit. If you’re looking for something that’s batteries-included, you will probably want to consider other options.
Conclusion
Axon.Serving
is an exciting new feature that makes integrating Axon into production applications seamless.
However, I’d like to emphasize that Axon.Serving
is new. You might encounter edge-cases and missing features. I encourage you to experiment with Axon.Serving
and report any and all issues, failure-cases, performance problems, etc.
Additionally, there are many features we are experimenting with and considering adding to the API. If there’s anything you think should be included, don’t hesitate to open an issue.
Until next time!