Accelerating Decision Trees with Mockingjay

Introduction

In my last post, I introduced EXGBoost. EXGBoost is a library for training and using Gradient Boosting, powered by the popular XGBoost library.

Gradient Boosting is a powerful machine-learning technique that relies on an ensemble of decision trees to make predictions about data. Gradient Boosting is one of the few methods that still outperforms deep learning on certain classes of data. Specifically, Gradient Boosting is the methodology of choice for structured data. Structured data often represents a majority of business data, which means gradient boosting models are ubiquitous in the world of machine learning.

EXGBoost works as an NIF wrapper around the XGBoost C API, rather than working with a native Nx or Elixir implementation. One of the early design decisions of Nx was to not have an ecosystem built with an over-reliance on NIFs. The Nx backend and compiler infrastructure is designed to be an extensible foundation on which third-party libraries can build without locking themselves into the implementation details of specific machine-learning ecosystems.

So, why not do the same for gradient boosting? The main reason is that it is just easy to represent the training and prediction algorithms used in XGBoost using native Nx. While it is certainly possible to write some of these routines in pure Elixir, the speed is just not up to snuff. Relying on a NIF for EXGBoost was the easiest path to success.

Of course, the NIF implementation has some drawbacks. How can I use the NIF in a straightforward manner in an Nx.Serving? How can you use a trained EXGBoost model in a numerical definition? The short answer was that you just can’t—until now.

Introducing Mockingjay

In a follow-up to his debut Elixir library Andres Alejos created Mockingjay. Mockingjay is a library designed to implement algorithms present in Microsoft’s Hummingbird for compiling tree-based machine learning algorithms into native Nx functions. That means you can take a trained EXGBoost model, transform it into a native numerical definition, and JIT-compile or wrap it in an Nx serving just as you would any other Nx numerical definition. After training your EXGBoost model, you can compile it, and forget about the NIF.

I really want to emphasize how awesome this is. After training your EXGBoost model, you can transform your model into a format that works with all of your other Nx code. If you have an existing serving pipeline, you can just wrap your compiled model in a serving and use that instead. Everything just works.

Even cooler: Mockingjay is extensible and is designed to interface with other tree-based machine learning methods. There’s no reason it needs to just work with EXGBoost. For example, if you messed around with implementing a decision tree library using pure Elixir, you could implement the Mockingjay protocol and instantly have a pipeline to turn your pure Elixir implementation into a blazing-fast Nx implementation.

Thanks to Nx’s backend and compiler infrastructure, compiled models can interface directly with accelerators such as the GPU or TPU. Several initiatives exist in the Python to bring decision trees to the GPU. XGBoost has CUDA bindings for this as well. Fortunately, we don’t have to mess with them at all. We can just compile our models to Nx and everything works.

And, if all of this ecosystem unification isn’t enough, early benchmarks show that Mockingjay compiled decision trees run up to 70% faster on the CPU with Nx and EXLA vs. just raw EXGBoost. Not only do you get a massive simplification of your machine learning stack, but you also get a significant performance boost.

Using Mockingjay

Getting started with Mockingjay is straightforward. EXGBoost abstracts away most of the work of converting models with Mockingjay. First, you need a trained model. To get there, we’ll start with some data:

{x, y} = Scidata.Iris.download()
data = Enum.zip(x, y) |> Enum.shuffle()
{train, test} = Enum.split(data, ceil(length(data) * 0.8))
{x_train, y_train} = Enum.unzip(train)
{x_test, y_test} = Enum.unzip(test)

x_train = Nx.tensor(x_train)
y_train = Nx.tensor(y_train)

x_test = Nx.tensor(x_test)
y_test = Nx.tensor(y_test)

Next, train a simple booster:

model = EXGBoost.train(x_train, y_train, num_class: 3, objective: :multi_softprob)

Then, call EXGBoost.compile with your trained model:

predict = EXGBoost.compile(model)

And that’s it!

You’ll notice EXGBoost.compile/1 returns an anonymous function. This is a prediction function that can be JIT compiled or wrapped in a serving. You can just call predict with your data to get predictions in the same way you would use EXGBoost.predict. You can even verify the predictions are almost the same:

native_preds = EXGBoost.predict(model, x_test)
jit_preds = predict.(x_test)

With this strategy, there is some expected error, though the error is relatively small. In most cases, your predictions from a Mockingjay compiled model will be identical to the EXGBoost model.

There are more advanced usages of Mockingjay that allow you to customize a compilation strategy to optimize your compiled model; however, most users can get away with just letting Mockingjay do the work for you. With just a single line of code, you can unlock significant performance boosts and greatly simplify your machine-learning stack.

Conclusion

I really want to emphasize just how awesome this work by Andres is. Thanks to Mockingjay, users can build applications on a familiar foundation. Our reliance on NIFs remains minimal. And users get significant performance boosts to boot. The Nx ecosystem presents the most unified machine-learning experience from training to production of any other ecosystem out there. Mockingjay is just another example of this experience.

Cutting-edge Elixir technology can be the cost-saving, competitor-beating tool you need to succeed. Contact us today to learn how we can put it to work for you.

Introduction

Introducing Mockingjay

Using Mockingjay

Conclusion

Newsletter

Stay in the Know