What we are going to do

In this tutorial, we will run the optimized model using multiple threads. This is the best way to optimize the model for pre-defined number of threads - we are currently developing a more flexible approach. Using this setup, you can execute your AI model with the specified number of threads concurrently.

We assume that you are already familiar with optimizing your AI model using Optimium, and that Optimium, Optimium Runtime, and the remote settings are already installed. If not, please refer to the previous tutorials.

Concepts

Optimium lets you specify the number of active threads per model, though additional threads may be created for tasks like logging. These extra threads, however, remain idle most of the time.
(The total number of created threads is the sum of user-specified threads per model plus 2.)

Avoid setting more threads than the number of cores on your device for optimal performance.

Optimize

Setting the number of threads during the optimization step is straightforward. The process is the same as for single-thread optimization, except for configuring the user_arguments.json file.

Run Python and enter the following commands to create json file:

import optimium
optimium.create_args_template()

During the process, you will be prompted to specify the number of threads:

Press "your desired number of threads" when prompted with "Specify number of threads to use during inference (minimum = 1)"

Or, if you have already created a user_arguments.json file, you can directly edit the num_threads field to set your desired thread count.

{
  ...
	"runtime": {
      "num_threads": <set the desired number of threads> (defaults : 1)
  },
  ...
}

Deploy & test

When deploying the optimized model, ensure that the runtime API is configured correctly for multi-threading.

Optimium Runtime provides APIs for three languages: Python, C/C++, and Kotlin. Below, we show how to modify the code for each language to enable multi-threading. In this tutorial, we assume you are using 4 threads (you can change this value as desired).

We assume that you are already familiar in using Optimium Runtime for single-thread.

Python

For single-thread execution, we loaded the model as follows:

ctx = rt.Context()
model = ctx.load_model("path/to/model")

For multi-thread execution, the code should be as follows:

threads = 4
ctx = rt.Context()
devices = [
  rt.Device.native(ctx),
  rt.Device.xnnpack(ctx)
]
model = ctx.load_model("path/to/model", devices, 
                       threads_count = threads)

C++

For single-thread execution, the model was loaded as follows:

rt::Result<rt::Context> context = rt::Context::create();
rt::Result<rt::Model> model = context.loadModel("path/to/model");

For multi-thread execution, the code should be as follows:

int threads = 4;
rt::Context context = rt::Context::create();
std::vector<rt::Device> devices;

if (rt::Result<rt::Device> dev = rt::Device::native(ctx)){
  devices.emplace_back(dev.value());
} else{
  // error
}

if (rt::Result<rt::Device> dev = rt::Device::xnnpack(ctx)){
  devices.emplace_back(dev.value());
} else{
  // error
}

rt::ModelOptions options;
options.ThreadsCount = threads;
auto model = context.loadModel("path/to/model", rt::ArrayRef(devices), options);

Kotlin

For single-thread execution, the model was loaded in the following file: src/main/kotlin/com/enerzai/optimium/example/android/InferenceViewModel.kt (see Run face detection in Android with Kotlin

Single-thread example:

import java.io.File
  
val ctx = ContextFactory().build()
val model = ctx.loadModel(File("path/to/model"))

For multi-thread execution, the code should be as follows:

import java.io.File

val ctx = ContextFactory().build()
val threads = 4
val devices = listOf(
    native(ctx),
  	xnnapck(ctx)
)
val model = ctx.loadModel(File("path/to/model"), devices, 
                          threadsCount = threads)