C++

C++

1. Install prerequisites.

To run this example app, you need CMake 3.21 or above (if you decide to use CMake) and C++ compiler that supports C++17 (tested on GCC 8, Clang 10 or above).

# debian-based distros
sudo apt-get install build-essential cmake ninja-build

2. Install Optimium Runtime. Please click here to install the runtime.

3. Add Optimium Runtime for dependency.

find_package(Optimium-Runtime REQUIRED)
target_link_libraries(MyExecutable PRIVATE Optimium::Runtime)

# Use C++17
set(CMAKE_CXX_STANDARD 17)

pkg-config is supported for non-CMake users.
You can get compiler options via `pkg-config --libs --cflags optimium-runtime`.

Optimium Runtime requires C++17 to compile correctly. For that, set(CMAKE_CXX_STANDARD 17) to set C++ language version globally. Or set_target_properties(<TARGET> PROPERTIES CXX_STANDARD 17) to use C++17 only for your cmake target.

IMPORTANT! If you're using Android, please refer to below code.

set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE BOTH)
find_package(Optimium-Runtime REQUIRED)
target_link_libraries(MyExecutable PRIVATE Optimium::Runtime)

pkg-config is supported for non-CMake users.
You can get compiler options via `pkg-config --libs --cflags optimium-runtime`.

Plus, if you’re using Android you must add android:extractNativeLibs=true in AndroidManifest.xml file.

 <application ...
              android:extractNativeLibs="true"
              ...>

4. Create a context.

Context is a manager object which is responsible for managing devices, remote connections, and loading models, etc.

Before loading a model, you should create a context.

You can modify the verbosity level and output path of the logger.

To give those change, use LogSettings class before calling Context::create() function.

#include <Optimium/Runtime.h>
#include <Optimium/Runtime/Utils/StreamHelper.h> // for logging purpose.
#include <Optimium/Runtime/Logging/LogSettings.h>

int main(...) {
    // change verbosity to debug level
    rt::LogSettings::setLogLevel(rt::LogLevel::Debug);

    // add console log writer. this is no-op on Android devices.
    rt::LogSettings::addWriter(rt::WriterOption::ConsoleWriter());

    // add file log writer to "output.log" file.
    rt::LogSettings::addWriter(rt::WriterOption::FileWriter("output.log"));

    // add Android Logcat log writer. this is no-op on non-Android devices.
    rt::LogSettings::addWriter(rt::WriterOption::AndroidWriter());

    rt::Result<rt::Context> MaybeContext = rt::Context::create();
    if (!MaybeContext.ok()) {
        std::cout << "error: " << MaybeContext.error()
                  << std::endl;
        return 1;
    }

    rt::Context Context = MaybeContext.value();

    // ...
}

❗️
If the value type is Result<...>, returned value by Optimium Runtime needs to be checked.
Error check was excluded for the simplicity of Docs, but users must do the error check.

It is recommended to check available devices before loading a model.

DeviceNotFoundError is a common error when you load a model without checking whether or not the required device is present.

You can get list of available devices from Context.getAvailableDevices() function.

int main(...) {
    // ...

    // Iterate every devices to check the device exists.
    bool Found = false;
    for (const rt::Device &Dev : Context.getAvailableDevices()) {
        if (Dev.getPlatform() == rt::PlatformKind::Native) {
            Found = true;
            break;
        }
    }

    if (!Found) {
        std::cout << "error: cannot find needed device."
                  << std::endl;
    }

    // ...or you can try device functions to test the device exists.
    rt::Result<rt::Device> MaybeDevice = rt::Device::native(Context);
    if (!MaybeDevice.ok()) {
        std::cout << "error: cannot find needed device: "
                  << MaybeDevice.error() << std::endl;
    }
}

5. (Optional) Connect to remote host

Optimium supports running inference on a remote host. However, before connecting to the remote host, you should install and run optimium-remote-server CLI application first.

Please click here to install remote server.

After installing and launching the optimium-remote-server, follow this code to connect to remote server.

RemoteContext represents a context of the remote host. Its role may look like Context, but RemoteContext only supports enumerating and creating devices.

int main(...) {
    // ...

    // connect to the remote host.
    // host address can be IP address or domain.
    // note that direct connection is only supported connection method.
    rt::Result<rt::RemoteContext> RemoteContext =
        Context.connectRemote("your-remote-address",
                              rt::ConnectionMethod::Direct).value();

    // ...or try this if you changed port configuration.
    rt::Result<rt::RemoteContext> RemoteContext =
        Context.connectRemote("your-remote-address",
                              rt::ConnectionMethod::Direct,
                              YOUR_PORT).value();

    // enumerating devices is identical to Context.
    for (const rt::Device &Dev : RemoteContext.getAvailableDevices()) {
        std::cout << Dev << std::endl;
    }
}

ℹ️
You can run inference using both local and remote devices simultaneously.
But communicating between devices over the network is heavy operation. Therefore, it is not recommended for production applications to run model with local and remote devices or remote devices between different hosts.

6. Load a model

Model represents an ML model as you know. You can load a model via Context.loadModel() function.

Optimium Runtime automatically search the devices described in the model, but it is limited to local devices.

To specify the device to run the model or to use a remote device, you should manually specify the device to use.

You can configure various threading-related settings(threads count, nice, ...) through the ModelOptions object.
(These settings are applied only when inferring the specified model.)

📘
Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.

int main(...) {
    // ...

    // load a model with auto-detected devices.
    rt::Model Model = Context.loadModel("path/to/model").value();

    // load a model with manually specified devices.
    rt::Device Devices[2] = {
        // Use device for XNNPACK
        rt::Device::xnnpack(Context),

        // Use cpu device at remote host
        rt::Device::native(RemoteContext)
    };


    rt::ModelOptions Opt;
    std::vector<uint32_t> Cores {0, 1, 2, 3};
    // set the number of threads to be used for running the model.
    Opt.ThreadsCount = ThreadsCount;
    // set which cores the threads used for inference will be assigned to.
    Opt.Cores.assign(Cores.begin(), Cores.end());
    // set the priority for threads running the model.
    Opt.Nice = -19;
  
    rt::Model Model =
        Context.loadModel("path/to/model",
                          rt::ArrayRef(Devices), Opt).value();

    // loadModel() also accepts single device
    rt::Model Model =
        Context.loadModel("path/to/model",
                          rt::Device::native(Context), Opt).value();

    // ..
}

7. Listing model information

You can find information about the model using some informative functions.

To get a list of input or output tensors, use Model.getInputTensorsInfo() function for input and Model.getOutputTensorsInfo() function for output.

Functions are also provided if you want to get information about a single input or output tensor by its index or name.

int main(...) {
    // ...

    // get list of input tensor info
    std::cout << "input tensors" << std::endl;
    for (const rt::TensorInfo &Info : Model.getInputTensorsInfo())
        std::cout << Info << std::endl;

    // get list of output tensor info
    std::cout << "output tensors" << std::endl;
    for (const rt::TensorInfo &Info : Model.getOutputTensorsInfo())
        std::cout << Info << std::endl;

    // printing input tensor info
    std::cout << Model.getInputTensorInfo(0).value()
              << std::endl;
    std::cout << Model.getInputTensorInfo("input_0").value()
              << std::endl;

    // printing output tensor info
    std::cout << Model.getOutputTensorInfo(0).value()
              << std::endl;
    std::cout << Model.getOutputTensorInfo("output").value()
              << std::endl;
}

TensorInfo, the return value of functions above, represents information about each tensor.

It contains the tensor's name, type, shape, alignment.

Quantization scheme is also contained if the tensor is quantized.

You can access member variables to get details of the tensor.

int main(...) {
    // ...

    rt::TensorInfo Info = Model.getInputTensorInfo("input_0").value();

    std::cout << "name of tensor: "
              << Info.TensorName << std::endl;
    std::cout << "shape of tensor: "
              << Info.TensorShape << std::endl;
    std::cout << "alignment of tensor: "
              << Info.Alignment << std::endl;
    std::cout << "type of tensor: "
              << Info.TensorType << std::endl;
    std::cout << "name of tensor: "
              << Info.TensorName << std::endl;

    if (Info.Scheme)
        std::cout << "scheme of tensor: "
                  << *(Info.Scheme) << std::endl;

    // ...
}

8. Creating a request

InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute the same model without interfering with other requests.

Additionally, users can achieve target throughput by queueing multiple requests efficiently.

ℹ️
Running multiple requests simultaneously is currently not available. It will be updated in a near future.

Creating request is done by calling Model.createRequest() function.

int main(...) {
    // ...

    rt::InferRequest Request = Model.createRequest().value();

    // ...
}

9. Getting a tensor

You can get a tensor from getInputTensor() and getOutputTensor() functions.

And you can use copyFrom() and copyTo() functions to copy data between the tensor and data buffer.

int main(...) {
    // ...

    constexpr size_t kInput0Size = ...;
    constexpr size_t kOutputSize = ...;

    // Assume that input and output buffers are prepared.
    float* InputBuffer = ...;
    float* OutputBuffer = ...;

    // Get input tensor and put data
    rt::Tensor Input0 = Request.getInputTensor("input_0").value();
    Input0.copyFrom(InputBuffer, kInput0Size );

    // Get output tensor and get data
    rt::Tensor Output = Request.getOutputTensor("output").value();
    Output.copyTo(OutputBuffer, kOutputSize);
}

Optimium Runtime checks the type of input data that the user puts and type of the tensor are compatible. If they do not match, it rejects with Status::TypeMismatch.

Tensor type is defined for types not supported by C++ and that files located at Optimium/Runtime/Types.

Tensor type and wrapped C++ type are listed below.

Element Type	C++ Type
`ElementType.F16`	`rt::float16`
`ElementType.BF16`	`rt::bfloat16`
`ElementType.TF32`	`rt::tfloat32`
`ElementType.QS8`	`rt::qsfloat8`
`ElementType.QU8`	`rt::qufloat8`
`ElementType.QS16`	`rt::qsfloat16`
`ElementType.QU16`	`rt::qufloat16`

ℹ️
Optimium Runtime only recognizes those C++ types for corresponding tensor type when the runtime checks data type for the tensor. Other data types are not recognized and results in compilation error.

If you want to avoid the copy cost, you can directly access to the device's memory that backs up the tensor.

int main(...) {
    // ...

    rt::Tensor Tensor = Request.getInputTensor("input_0").value();
    {
        rt::BufferHolder Buffer = Tensor.getRawBuffer();
        void *Ptr = Buffer.data();

        // do some direct write
    }

    // ...
}

❗️
This backing buffer does not perform any checks - type check, range check, etc. Use this feature at your own risk

ℹ️
BufferHolder, a Buffer variable in the code above, is a RAII(Resource Acquisition Is Initialization) object that occupies memory only within its scope. Therefore, referencing this buffer from different scope should be avoided.

10. Running an inference

Running an inference needs just two lines of code: infer and wait.

Request.infer() function requests starting the inference to the runtime and return immediately and Request.wait() function waits for the previously requested inference to finish (Regardless of failure).

Request.wait() function has argument, timeout, represents milliseconds to wait. If the method returns false, it represents the inference was finished before the timeout is reached. The method returns true when the inference was not finished before the timeout is reached.

Note that Request.wait() method should always be called to check error that was happened during inference. It might cause undefined behavior if the request starts the inference in a fault state.

To check the state of the request, use Request.getStatus() function.

int main(...) {
    // ...

    // waits until inference is finished.
    Request.infer();
    Request.wait();

    // waits 500 milliseconds to finish inference
    using namespace std::chrono_literals;

    Request.infer();

    if (Request.wait(500ms).value())
        std::cout << "inference not finished after 500ms"
                  << std::endl;
    else
        std::cout << "inference was finished within 500ms"
                  << std::endl;

    // check the state of the request
    std::cout << "status of request: "
              << Request.getStatus()
              << std::endl;
}

Python

Python

❗️
Do not put any Optimium Runtime related objects in the global scope or create circular references to them.
This can lead to memory leakages or undefined behavior due to differences in memory management model between C++ and Python.

1. Install Optimium Runtime. Please click here to install the runtime.

2. Create a context

Context is a manager object which is responsible for managing devices, remote connections, and loading models, etc.

Before loading a model, you should create a context.

You can modify the verbosity level and output path of the logger.

import optimium.runtime as rt

def main():
    context = rt.Context(
        # change verbosity to debug level.
        verbosity=rt.LogLevel.Debug,
        # change log output from stdout to "output.log" file.
        # default is writing log at stdout.
        log_path="output.log"
    )

It is recommended to check available devices before loading a model.

DeviceNotFoundError is a common error when you load a model without checking whether or not the required device is present.

You can get a list of available devices context.available_devices property.

def main():
    # ...

    # Iterate every devices to check the device exists.
    for dev in context.available_devices:
        if dev.platform == rt.PlatformKind.Native:
            found = True

    if not found:
        print("cannot find needed device")

    # ...or you can try device functions to test the device exists.
    try:
        rt.Device.native(context)
    except rt.DeviceNotFoundError as ex:
        print(f"cannot find needed device: {ex}")

3. (Optional) Connect to remote host

Optimium supports running inference on a remote host. However, before connecting to the remote host, you should install and run optimium-remote-server CLI application first.

Please refer here to install remote server.

After installing and launching optimium-remote-server, follow this code to connect to the remote server.

RemoteContext represents a context of the remote host. Its role may look like Context, but RemoteContext only supports enumerating and creating devices.

def main():
    # ...

    # connect to the remote host.
    # host address can be IP address or domain.
    remote_context = context.connect_remote("your-remote-address")

    # ...or try this if you changed port configuration
    remote_context = context.connect_remote("your-remote-address", port=YOUR_PORT)

    # enumerating devices is identical to Context. Same property, same device function.
    for dev in remote_context.available_devices:
        print(dev)

    remote_native = rt.Device.native(remote_context)

ℹ️
You can run inference using both local and remote devices simultaneously.
But communicating between devices over a network incurs costs. Therefore, it is not recommended for production applications to run models with local and remote devices or remote devices between different hosts.

4. Load a model

Model represents an ML model as you know. You can load a model via context.load_model() method.

Optimium Runtime automatically finds and uses devices described in the model, but it is limited to local devices. To specify the device that runs the model or to use a remote device, you should manually specify the device to use.

You can configure threading-related settings by passing them as parameters to the context.load_model function.
(These settings are applied only when inferring the specified model.)

ℹ️
Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.

def main():
    # ...

    # load a model with auto-detected devices.
    model = context.load_model("path/to/model")

    # load a model with manually specified devices.
    devices = [
        # Use device for XNNPACK
        rt.Device.xnnpack(context),

        # Use cpu device at remote host
        rt.Device.native(remote_contex),
    ]

    model = context.load_model("path/to/model", devices
                              # set the number of threads to be used for running the model.
                              threads_count = 4,
                              # set which cores the threads used for inference will be assigned to.
                              cores = [0, 1, 2, 3],
                              # set the priority for threads running the model.
                              nice = -19)

5. Listing model information

You can find information about the model using some informative methods.

To get a list of input or output tensors, use model.input_tensors_info property for input and model.output_tensors_info property for output.

Methods are also provided if you want to get information about a single input or output tensor by its index or name.

def main():
    # ...

    print("input tensors")
    for info in model.input_tensors_info:
        print(info)

    print("output tensors")
    for info in model.output_tensors_info:
        print(info)

    # printing input tensor info
    print(model.get_input_tensor_info(0))
    print(model.get_input_tensor_info("input_0"))

    # printing output tensor info
    print(model.get_output_tensor_info(0))
    print(model.get_output_tensor_info("output"))

TensorInfo, the return value of the properties and methods above, represents information about each tensor. It contains the tensor's name, type, shape, alignment. Quantization scheme is also contained if the tensor is quantized.

You can access properties to get details of the tensor.

def main():
    # ...

    info = model.get_input_tensor_info("input_0")

    print(f"name of tensor: {info.name}")
    print(f"shape of tensor: {info.shape}")
    print(f"alignment of tensor: {info.alignment}")
    print(f"type of tensor: {info.type}")

    if info.scheme:
        print(f"quantization scheme of tensor: {info.scheme}")

6. Creating a request

InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute the same model without interfering with other requests.

Additionally, users can achieve target throughput by queueing multiple requests efficiently.

ℹ️
Running multiple requests simultaneously is currently not available. It will be updated in near feature.

Creating request is done by calling model.create_request() method.

def main():
    # ...

    request = model.create_request()

7. Getting a tensor

Optimium Runtime for Python supports NumPy's np.ndarray natively. You can read and write tensor data as NumPy's ndarray.

request.set_inputs() accepts single np.ndarray, sequence of np.ndarray or dictionary of string key to np.ndarray and request.get_outputs() returns list of np.ndarray.

# ...
import numpy as np

def main():
    # ...

    input_0 = np.ones((1, 2, 3), dtype=np.float32)
    input_1 = np.ones((1, 2, 3), dtype=np.float32)

    # this form is allowed when the model has single input.
    request.set_inputs(input_0)

    # use list or tuple to put multiple data at once.
    request.set_inputs([input_0, input_1])

    # dictionary is also supported.
    request.set_inputs({
        "input_0": input_0,
        "input_1": input_1
    })

    # to get outputs, use get_outputs()
    outputs = request.get_outputs()

You can also access Tensor of the request to access tensor data.

def main():
    # ...

    # get tensor by name
    input_0 = request.get_input_tensor("input_0")

    # get tensor by index
    input_1 = request.get_input_tensor(1)

    # get data from the tensor
    input_0_data = input_0.to_numpy()

    # put data into the tensor
    input_1.copy_from(np.ones((1, 2, 3), dtype=input_1.type.to_dtype()))

8. Running an inference

Running an inference is just two lines of code: infer and wait.

request.infer() method requests starting the inference to the runtime and return immediately and request.wait() method waits previously requested inference until it finished (regardless of failure).

request.wait() method has argument, timeout, represents milliseconds to wait. if the method returns False, it represents the inference was finished before the timeout is reached. The method returns True when the inference was not finished before the timeout is reached.

Note that request.wait() method should always be called to check error that was happened during inference. It might cause undefined behavior if the request started the inference in fault state.

To check the state of the request, use request.status property.

def main():
    # ...

    # waits until inference is finished.
    request.infer()
    request.wait()

    # waits 500 milliseconds to finish inference
    request.infer()

    if request.wait(500):
        print("inference not finished after 500ms")
    else:
        print("inference was finsihed within 500ms")

    # check status of the inference.
    print(f"current state of request: {request.status}")
    request.wait()

Kotlin

Kotlin

1. Import Optimium Runtime to your project.

Click here to see how to import Optimium Runtime into your project.

If you’re using Android you must add android:extractNativeLibs=true in AndroidManifest.xml file.

 <application ...
              android:extractNativeLibs="true"
              ...>

2. Create a context

Context is manager object that is responsible for manages devices, remote connection and loading models, etc.

Before you load model, you should create a context.

Creating context is done by ContextFactory. You can change verbosity of logger and output path of the logger.

// ...
import com.enerzai.optimium.runtime.Context
import com.enerzai.optimium.runtime.ContextFactory

fun main(...) {
    val factory = ContextFactory()

    // change verbosity to debug level.
    factory.verbosity(LogLevel.DEBUG)

    // add file log writer to "output.log" file.
    factory.enableFileLog(File("output.log"))

    // add console log writer. note that this is no-op on Android devices.
    factory.enableConsole()

    // add Android logcat log writer. note that this is no-op on non-Android devices.
    factory.enableLogcat()

    factory.create().use { context ->
        // ...
    }

    // ...or use traditional way.
    val context = factory.create()

    // do something important

    // you should close after use.
    context.close()
}

ℹ️
Users should call close() method in Context, Model and InferRequest objects after use. You can also use Java's try-with-resource statement or Kotlin's use() function.

It is recommended to check available devices before you loading a model.

DeviceNotFoundException is a common error when you load a model without checking whether or not the required device is present.

You can get list of available devices context.availableDevices property.

// ...
import com.enerzai.optimium.runtime.Devices.native
import com.enerzai.optimium.runtime.PlatformKind
import com.enerzai.optimium.runtime.exceptions.DeviceNotFoundException

fun main(...) {
    // ...

    // Iterate every devices to check the device exists.
    val dev = context.availableDevices.find {
        it.platform == PlatformKind.NATIVE
    }
    if (dev != null) {
        println("cannot find neeeded device")
    }

    // ...or you can try device functions to test the device exists.
    val dev = try {
        native(context)
    } catch (ex: DeviceNotFoundException) {
        println("cannot find needed device: $ex")
        null
    }
}

3. Connect to remote host (Optional)

Optimium supports running inference on remote host. However, before connecting to the remote host, you should install and run optimium-remote-server CLI application first.

Please refer here to install remote server.

After installing and launching optimium-remote-server, follow this code to connect remote server.

RemoteContext represents a context of the remote host. Its role may look like Context, but RemoteContext only supports enumerating and creating devices.

// ...
import com.enerzai.optimium.runtime.RemoteContext

fun main(...) {
    // ...

    // connect to the remote host.
    // host address can be IP address or domain.
    val remoteContext =
        context.connectRemote("your-remote-address")

    // ...or try this if you changed port configuration.
    val remoteContext =
        context.connectRemote("your-remote-address",
                              port = YOUR_PORT)

    // enumerating devices is identical to Context. Same property, same device function.
    remoteContext.availableDevices.forEach { println("$it") }

    val remoteNative = native(remoteContext)
}

ℹ️
You can run inference using local and remote devices simultaneously.
But communicating between devices over the network is heavy operation. Therefore, it is not recommended for production applications to run model with local and remote devices or remote devices between different hosts.

4. Load a model

Model represents an ML model as you know. You can load a model via context.loadModel() method.

Optimium Runtime automatically finds and uses devices described in the model, but it is limited to local devices. To specify device that runs the model or to use a remote device, you should manually specify the device to use.

You can configure threading-related settings by passing them as parameters to the context.loadModel() function. (These settings are applied only when inferring the specified model.)

ℹ️
Unlike other AI inference engines like Tensorflow Lite, Optimium allows the model format to be a folder. Since the folder is considered as the model, you should type the path to the model when loading it from the Optimum Runtime. You must always copy the model along with its folder.

// ...
import com.enerzai.optimium.runtime.Model
import com.enerzai.optimium.runtime.Devices.xnnpack

import java.io.File

fun main(...) {
    // ...

    // load a model with auto-detected devices.
    context.loadModel(File("path/to/model")).use { model ->
        // do something useful
    }

    // load a model with manually specified devices.
    val devices = listOf(
        // Use device for XNNPACK
        xnnpack(context),

        // Use cpu device at remote host
        native(remoteContext)
    )

    context.loadModel(File("path/to/model"), devices, 
                      // set the number of threads to be used for running the model.
                      threadsCount = 4,
                      // set which cores the threads used for inference will be assigned to.
                      cores = listOf(0, 1, 2, 3),
                      // set the priority for threads running the model.
                      nice = -19
                     ).use { model ->
        // do something useful
    }
}

5. Listing model information

You can find information about the model using some informative methods.

To get a list of input or output tensors, use model.inputTensorsInfo property for input and model.outputTensorsInfo property for output.

Methods are also provided if you want to get information about a single input or output tensor by its index or name.

// ...
import com.enerzai.optimium.runtime.TensorInfo

fun main(...) {
    // ...

    println("input tensors")
    model.inputTensorsInfo.forEach {
        println("$it")
    }

    println("output tensors")
    model.outputTensorsInfo.forEach {
        println("$it")
    }

    // printing input tensor info
    println("${model.getInputTensorInfo(0)}")
    println("${model.getInputTensorInfo("input_0")}")

    // printing output tensor info
    println("${model.getOutputTensorInfo(0)}")
    println("${model.getOutputTensorInfo("input_0")}")
}

TensorInfo, the return value of properties and methods above, represents information of each tensor. It contains the tensor's name, type, shape, alignment. Quantization scheme is also contained if the tensor is quantized.

You can access properties to get details of the tensor.

fun main(...) {
    // ...

    val info = model.getInputTensorInfo("input_0")

    with(info) {
        println("name of tensor: $name")
        println("shape of tensor: $shape")
        println("alignment of tensor: $alignment")
        println("type of tensor: $type")

        if (scheme != null) {
            println("scheme of tensor: $scheme")
        }
    }
}

6. Creating a request

InferRequest represents a single inference that the model runs. Users can create multiple InferRequests and execute same model without interfering with other requests.

Additionally, users can achieve target throughput by queueing multiple requests efficiently.

ℹ️
Running multiple requests simultaneously is currently not available. It will be updated in near feature.

Creating request is done by calling model.createRequest() method.

// ...
import com.enerzai.optimium.runtime.InferRequest

fun main(...) {
    // ...

    model.createRequest().use { request ->
        // do something useful
    }
}

7. Getting a tensor

Optimium Runtime for Kotlin supports primitive array and Buffer classes to read and write tensor data.

// ...
import java.nio.FloatBuffer

fun main(...) {
    // ...
    val buffer = FloatBuffer.allocate(...)

    request.setInput("input_0", buffer)

    // indexes are also supported
    request.setInput(0, buffer)

    val array = FloatArray(...)

    // get output data
    request.getOutput("output", array)
}

You can also access Tensor of the request to access tensor data.

// ...
import com.enerzai.optimium.runtime.Tensor

fun main(...) {
    // ...

    val input0 = request.getInputTensor("input_0")

    val array = FloatArray(...)

    // Put data into the tensor
    input0.copyFrom(array, offset = ...)

    // Get data from the tensor
    input0.copyTo(array, offset = ...)
}

Due to limitation of Java, types that cannot be expressed in Java are changed to other types that have same size.

Accepted Java types for the tensor are listed below:

Java Types	ElementType
Byte (ByteBuffer)	Allowed for any types
Short (ShortBuffer)	I16, U16, F16, QS16, QU16
Int (IntBuffer)	I32, U32
Long (LongBuffer)	I64, U64
Float (FloatBuffer)	F32
Double (DoubleBuffer)	F64
Boolean	BOOL

❗️
Please be aware that Java types and actual tensor type can be different. Check your tensor type before use values on Java.

ℹ️
Buffer for ElementType.BOOL is not supported: Java does not provide equivalent one.

8. Running inference

Running inference needs just two lines of code: infer and waitForFinish.

request.infer() method requests starting the inference to the runtime and return immediately and request.waitForFinish() method waits for the previous requested inference to finish (regardless of failure).

request.waitForFinish() method has argument, timeout, represents milliseconds to wait. If the method returns false, the inference was finished before the timeout is reached. The method returns true when the inference was not finished before the timeout is reached.

Note that request.waitForFinish() method should always be called to check if an error occurred during inference. It might cause undefined behavior if the request started the inference in a fault state.

To check the state of the request, use request.status property.

fun main(...) {
    // ...

    // waits until inference is finished.
    request.infer()
    request.waitForFinish()

    // waits 500 milliseconds to finish inference
    request.infer()
    if (request.wait(500)) {
        println("inference not finished after 500ms")
    } else {
        println("inference was finished within 500ms"
    }

    // check status of the request
    print(f"current state of request: {request.status}")
}